<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sequential decision problems, dependently typed solutions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicola Botta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cezar Ionescu</string-name>
          <email>ionescug@pik-potsdam.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Brady</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Potsdam Institute for Climate Impact Research</institution>
          ,
          <addr-line>Telegrafenberg A31, 14473 Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of St Andrews</institution>
          ,
          <addr-line>KY16 9SX</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a dependently typed formalization for a simple class of sequential decision problems. For this class of problems, we implement a generic version of Bellman's backwards induction algorithm [2] and a machine checkable proof that the proposed implementation is correct. The formalization is generic. It is presented in Idris, but it can be easily translated to other dependently-typed programming languages. We conclude with an informal discussion of the problems we have faced in extending the formalization to generic monadic sequential decision problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this paper we formalize a simple class of sequential decision problems. For
this class, we implement a generic version of Bellman's backwards induction
algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and derive a machine checkable proof that the proposed
implementation is correct.
      </p>
      <p>
        The approach is similar in spirit to that proposed by de Moor [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], with the
focus on generic programming. Both the formalization of the problem and its
solution are based on an abstract context. In order to solve speci c
sequential decision problems, clients must provide problem-speci c instances of the
context. There are important di erences between our approach and de Moor's,
however. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Bellman's principle of optimality is rephrased in terms of three
requirements: a promotability requirement for a feasibility predicate p and two
monotonicity requirements for a list of decision functions fs (in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], both p and
fs are part of the abstract context). These requirements are expressed as
predicates in rst-order logic. In our approach, Bellman's principle of optimality is
stated in the same language used for implementing backwards induction. This
allows us to derive a proof of correctness (of the implementation) which can be
machine checked.
      </p>
      <p>
        An obvious consequence of our approach is that it requires a programming
language which is expressive enough to implement generic algorithms and to
encode (and prove) properties of such algorithms. Dependently typed languages
such as Idris [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Agda [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or Coq [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] support this by allowing types to be
predicated on values. A program's type encodes its meaning; allowing types to
be predicated on values means a programmer may give a precise meaning to a
program. Furthermore, generic programming is supported by allowing functions
to compute types directly. We have implemented our formalization in Idris but
a translation to Agda or other fully dependently-typed programming languages
would be straightforward.
      </p>
      <p>
        Another di erence between our approach and the one described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is that
we explicitly introduce a state space in our context and that the set of controls
(decisions) which are available at a given step depends on the actual state at that
step. In contrast, the set of controls in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is constant. As it turns out, a
statedependent notion of feasible controls is essential to generalize the formalization
presented here to the case of time-dependent state spaces and non-deterministic
transition functions. We do not present such a generalization here, but we discuss
the problem of extending our formalization in section 7.1.
1.1
      </p>
      <sec id="sec-1-1">
        <title>Programming with Dependent Types</title>
        <p>
          We will use the Idris programming language3. Idris is a general purpose
language with dependent types. Its syntax, and many of its features such as type
classes [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], are in uenced by Haskell. Idris supports algebraic data types, such
as natural numbers and polymorphic lists, declared as follows
data Nat = O j S Nat
data List a = Nil j (::) a (List a)
The canonical example of a dependent type is the type of lists indexed by length,
usually called a vector. It is declared as follows, giving explicit types for the type
constructor Vect as well as the constructors Nil and (::)
data Vect : Type ! Nat ! Type where
        </p>
        <sec id="sec-1-1-1">
          <title>Nil : Vect a O</title>
          <p>(::) : a ! Vect a k ! Vect a (S k )
Note that the constructors Nil and (::) have been overloaded for both Vect and
List | overloading is resolved by type. A function over a dependent type encodes
the invariant properties of its indices in its type, for example
(++) : Vect a n ! Vect a m ! Vect a (n + m)
(++) Nil ys = ys
(++) (x :: xs) ys = x :: xs ++ ys
A program with a dependent type can be understood as a proof of the logical
interpretation of that type. In particular, types can be used to represent relations,
such as equality
data (=) : a ! b ! Type where</p>
          <p>re : x = x
3 idris-lang.org
Types are rst class, meaning that they can be computed and manipulated like
any other term. For example this allows us to de ne type synonyms
IntVect : Int ! Type
IntVect n = Vect Int n
data so : Bool ! Type where</p>
          <p>
            oh : so True
We will make extensive use of the following type, so, which requires at type
checking time that a boolean value is provably true
The only canonical instance of so x is oh, and this can only be constructed if x is
True. Dependent types allow the value of an input to a function to be mentioned
in the type of the output | that is, a function type is a binder. In this way,
we can use so to guarantee that a required dynamic check has been made, for
example
safe divide : (x : Int ) ! (y : Int ) ! so (y 6 0) ! Int
safe divide x y oh = x `div ` y
A full tutorial for Idris is available elsewhere [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
1.2
          </p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Sequential decision problems</title>
        <p>In textbooks on sequential decision problems and dynamic programming, the
notions of state and control spaces are usually not formalized.</p>
        <p>
          In Chapter 1 of [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], for instance, the notions of control sequence, policy,
policy sequence, optimal policy sequence, Bellman's principle of optimality and
backwards induction are discussed in the context of a discrete-time dynamical
system of the form
        </p>
        <p>
          xk+1 = fk(xk; uk; wk)
where xk, uk and wk represent the state of the system, the selected control and
a random parameter at time k. The types of xk, uk and, a fortiori, the type of
fk, are not explicitly stated. In many examples, the type of xk turns out to be
Rn or Nn. But one can easily imagine sequential decision problems in which the
state of the system is represented by pairs or perhaps by lists of pairs, as for
example in the knapsack problem [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>What about controls? Is it safe to assume that the type of uk can be xed,
in speci c sequential decision problems, independently of xk? In problems of
resource allocation, one can often take controls to be real numbers on the unit
interval. The interpretation is that uk represents a fraction of the available
resource that are used for a certain purpose, for instance investment or saving.</p>
        <p>In general, however, the controls available in a given state do depend on
that speci c state. Of course, one can always embed the set of controls which
are available in a given state in a larger set, thereby recovering the case of a
constant control space. But then fk becomes a partial function and may fail to
produce a new state for some speci c state-control pairs.</p>
        <p>
          In the rest of this section we discuss a simple example of a sequential
decision problem. It is an instance of the well known \cylinder" problem originally
proposed by [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and extensively studied in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We will use this problem to
characterize SDPs and to exemplify and discuss our formalization.
1.3
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>The \cylinder" problem</title>
        <p>Consider the following problem: a decision-maker can be in one of ve states:
a, b, c, d or e. In a, the decision maker can choose between two options: move
ahead (option A) or move to the right (option R). In b, c and d he can move
to the left (L), ahead or to the right. In e he can move to the left or go ahead.
Upon selecting an option, the decision maker enters a new state. For instance,
selecting R in b brings him from b to c, see Figure 1, left. Thus, each step is
characterized by a current state, a selected control and a new state. A step also
yields a reward, for instance 3 for the transition from b to c and for option R,
see Figure 1, middle. The challenge for the decision maker is to make a nite
number of steps, say n, by selecting options that maximize the sum of the rewards
collected. An example of a possible trajectory and corresponding rewards for the
rst four steps is shown on the right of Figure 1.
n
...
...</p>
        <p>The cylinder problem is a very special example of a sequential decision
problem. In this particular problem, the set of possible states | the state space {
is nite and independent of the number of steps. The set of options available
to the decision maker in a given state | the control space | is nite and only
depends on that state and not, for instance, on the number of steps done.</p>
        <p>
          In most SDPs, these conditions are not met. In assembly line scheduling
problems [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], for instance, certain items have to be moved, at each of a xed
number of steps, to one of a nite number of assembly lines. Each step implies a
transfer cost (if the item is to change assembly line) and a production cost. The
latter depends on the current step and assembly line. The idea is to select moves
that minimize the overall costs of assembling a given item. While in assembly
line problems the state and the control space are nite, the number of assembly
lines available at a given step might depend on the number of steps already done
(assembly lines might go down for maintenance every so many steps).
        </p>
        <p>
          In optimal investment problems, the state space, the control space or both
might not be nite. The control space space, for instance, might be a real number
on the unit interval representing, in a time-discrete setting, the fraction of the
pro t to be consumed at that step. Examples of SDPs are discussed, among
others, in [
          <xref ref-type="bibr" rid="ref1 ref4 ref8">1, 4, 8</xref>
          ].
1.4
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>Basic assumptions</title>
        <p>In this paper, we do not require the state and the control spaces to be nite4,
but we do assume that the state space is independent of the number of steps and
that the options available to the decision maker in a given state only depend on
that state. Moreover, we assume that, at each step, the new state depends on the
current state and on the selected control via a transition function and that the
transition function is known to the decision maker. Similarly, we assume that
the reward collected in a given step depends on the current state, on the selected
control and on the new state through a reward function.</p>
        <p>These assumptions can be summarized in the notion of time-independent,
deterministic sequential decision problems. This is the \simple class of sequential
decision problems" considered in this paper.</p>
        <p>
          In contrast, general sequential decision problems are sequential decision
problems in which the state and control spaces might be time-dependent and the
outcome of a step can be a set of new states (non-deterministic sequential
decision problems) a probability distribution of new states (stochastic sequential
decision problems) or some other monadic structure of states, see [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>Time-independent, deterministic sequential decision problems are related to
but di erent from learning problems: reinforcement learning, supervised
learning, etc. In learning problems, the transition and the reward functions (or their
monadic counterparts) are not known or only partially known to the decision
maker and have to be \learned".
4 But see section 6.</p>
        <sec id="sec-1-4-1">
          <title>X : Type</title>
          <p>Y : X ! Type
To formalize the idea that the controls available in a speci c state depend only
on that state we introduce a function Y</p>
          <p>In section 7.1 we sketch the problems we have encountered in extending our
formalization to time-dependent, monadic sequential decision problems. Such
extension will be discussed in detail in a follow-up paper.
2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Context</title>
      <p>We start by formalizing the context for time-independent, deterministic
sequential decision problems. As anticipated in the introduction, we have very little
constraints on the state space: we just require it to be a valid type
The type Y x represents those controls (options, actions, etc.) which are
available (feasible, admissible, etc.) in x . Of course, there are alternative ways to
express the notion of state-speci c admissible controls. For instance, one could
explicitly introduce the set of all possible controls and represent admissibility in
terms of a relation from X to that set.</p>
      <p>Next, we formalize the notion of a deterministic transition function: selecting
a control in a state yields a well de ned new state</p>
      <p>step : (x : X ) ! Y x ! X
The fourth and last element of our generic context formalizes the notion of a
reward function. As discussed in the introduction, each state transition yields
a reward depending on the state in which the transition occurs, on the control
selected in that state (e.g., via control costs) and on the new state:
reward : (x : X ) ! Y x ! X ! Float</p>
      <p>Throughout this paper we take rewards to be Float s but this assumption
can be easily relaxed. Before turning to the next section a remark is in order:
consider the type of step. The type of the rst argument, X , is xed, but the type
of the second argument depends on the value of the rst one. Without support
for dependent types we would not be able to express such a type constraint
precisely. In Haskell, for instance we could have written</p>
      <p>step :: X ! Y 0 ! X
But then, step would not have been de ned for all y : Y 0 and it would have been
di cult to express (let alone prove) properties of methods that rely on step.</p>
    </sec>
    <sec id="sec-3">
      <title>Control sequences</title>
      <p>We want to formalize the notion of a sequence of controls. For the generic context
introduced in the previous section, the rst element of a sequence of controls for
x : X has to be a value of type Y x , say y . The second element has to be a
value of type Y (step x y ), say y 0. The third element has to be a value of type
Y (step (step x y ) y 0) and so on.</p>
      <p>It is clear that we cannot express the type of a control sequence in terms of
lists or vectors of elements of a xed type: we need dependently typed sequences
of controls. These can be indroduced by the type declaration:
data CtrlSeq : X ! Nat ! Type where</p>
      <sec id="sec-3-1">
        <title>Nil : CtrlSeq x O</title>
        <p>(::) : (y : Y x ) ! CtrlSeq (step x y ) n ! CtrlSeq x (S n)
As we explained in the introduction, the goal of the decision-maker is to
maximize the sum of the rewards collected in a given number of steps. Consequently,
the value of (selecting controls according to) a control sequence for an initial
state x : X and for a number of steps n is:
val : CtrlSeq x n ! Float
val Nil = 0
val fx g (y :: ys) = reward x y (step x y ) + val ys
The notation fx g in the second clause allows us to bring the implicit argument
x into scope, so that it can be used in the computation of the right-hand side.</p>
        <p>In general, di erent control sequences for the same initial value yield di erent
sequences of rewards and have di erent values. We want to formalize the notion
of optimal control sequences. The idea is that a control sequence ys for initial x
and n steps is optimal, if for every control sequence for the same x and n ys0,
the value of ys0 is at most equal to the value of ys:</p>
        <p>OptCtrlSeq : CtrlSeq x n ! Type
OptCtrlSeq fx g fn g ys =</p>
        <p>(ys0 : CtrlSeq x n) ! so (val ys0 6 val ys)
Thus, proving that a control sequence ys : CtrlSeq x n is optimal means
implementing a function that computes a value of type so (val ys0 6 val ys) for every
ys0 : CtrlSeq x n.</p>
        <p>It is easy to prove that the empty control sequence is optimal for every initial
state. All we have to do is to show that, for every initial state x , the value of
an arbitrary control sequence of length 0 is smaller or equal to the value of Nil .
This is certainly the case because, by de nition of CtrlSeq , there are no control
sequences of length 0 except for Nil :
nilIsOptCtrlSeq : (x : X ) ! OptCtrlSeq fx g Nil
nilIsOptCtrlSeq x ys 0 = re exive Float lte 0
In proving the property nilIsOptCtrlSeq we have applied re exive Float lte to 0.
re exive Float lte is a function of type (x :Float ) ! so (x 6 x ). For all (x :Float )
it computes a value of type so (x 6 x ). For understanding nilIsOptCtrlSeq it is
not important to know how re exive Float lte has been implemented5. But it
is important to notice that, when applied to 0, re exive Float lte computes a
value of type so (0 6 0). The Idris type checker knows that for all (x : X ) and
ys0 : CtrlSeq x O , both val ys0 and val Nil are equal to 0. Thus, the (type) value
of so (val ys0 6 val Nil ) can be reduced to so (0 6 0) and the type checker
is fully satis ed by re exive Float lte which, applied to 0, returns a value of
exactly this type.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Policy sequences</title>
      <p>It is easy to compute sequences of controls if one has a rule for which control to
select in every state. Such rules are called policies :</p>
      <sec id="sec-4-1">
        <title>Policy : Type</title>
        <p>Policy = (x : X ) ! Y x
PolicySeq : Nat ! Type</p>
        <p>PolicySeq n = Vect Policy n
Sequences of policies can be represented by plain Idris vectors, as described in
Section 1.1.</p>
        <p>Given an initial state and a policy sequence, the computation of the
corresponding sequence of controls is straightforward:
ctrls : (x : X ) ! PolicySeq n ! CtrlSeq x n
ctrls x Nil = Nil
ctrls x (p :: ps) = p x :: ctrls (step x (p x )) ps
The value of a policy sequence in terms of cumulated rewards is computed
analogously to the value of a control sequence de ned in section 3:</p>
        <p>Val : (x : X ) ! PolicySeq n ! Float
Val fn = O g x Nil = 0
Val fn = S m g x (p :: ps) = reward x (p x ) x 0 + Val x 0 ps where
x 0 : X
x 0 = step x (p x )
But the notion of optimality for policy sequences is somewhat stronger: we say
that a policy sequence ps is optimal if, for every initial state x , its value is at
least as good as the value of every other policy sequence (of the same length of
ps) for that x . Formally:
5 In fact, for the sake of nilIsOptCtrlSeq , re exive Float lte might just be a postulate.</p>
        <p>OptPolicySeq : (n : Nat ) ! PolicySeq n ! Type
OptPolicySeq n ps = (x : X ) !
(ps0 : PolicySeq n) !
so (Val x ps0 6 Val x ps)
This notion of optimality is relevant because of the following lemma:
OptLemma : (Re DecEq X ) )
(n : Nat ) !
(ps : PolicySeq n) !
(ops : OptPolicySeq n ps ) !
(x : X ) !</p>
      </sec>
      <sec id="sec-4-2">
        <title>OptCtrlSeq (ctrls x ps )</title>
        <p>Let's consider OptLemma more closely. The type constraint Re DecEq X
requires equality on X to be decidable and re exive. This is the rst example |
in our formalization | of a type constraint expressed in terms of a type class.</p>
        <p>OptLemma states that, for all X for which equality is decidable and re exive,
for all n : Nat and ps : PolicySeq n, if ps is an optimal policy sequence, then, for
every initial value x , ctrls x ps is an optimal sequence of controls.</p>
        <p>The quanti cation on x in the notion of optimality for policy sequences is
essential: without a notion of optimality \for a given x ", we would not be able
to prove OptLemma.</p>
        <p>Proving OptLemma means computing a value of type so (val ys0 6 val (ctrls x ps ))
given a policy sequence ps, a proof that ps is optimal ops, an arbitrary state x
and an arbitrary control sequence ys0. The proof is based on the the following
lemma, using propositional equality:
valValLemma : (x : X ) !
(ps : PolicySeq n) !</p>
        <p>Val x ps = val (ctrls x ps )
Val x ps = val (ctrls x ps )</p>
        <p>Val x Nil = val (ctrls x Nil )
follows from
valValLemma says that the value of a policy ps (as given by Val ) is equal to the
value of the corresponding sequence of controls as given by val . This is not really
surprising: a semi-formal proof of the val -Val equivalence can be easily derived
by induction on n. We have to show
The base case (n = O or, equivalently, ps = Nil ) is trivial, since</p>
        <p>= f def. of val g
val Nil</p>
        <p>= f def. of ctrls g
val (ctrls x Nil )
In the induction step, we have to show that
This can be readily done by applying the same de nitions as above and the
induction hypothesis:</p>
        <p>Val x (p :: ps)</p>
        <p>= f def. of Val with y = p x , x 0 = step x y g
reward x y x 0 + Val x 0 ps</p>
        <p>= f induction hypothesis g
reward x y x 0 + val (ctrls x 0 ps)</p>
        <p>= f def. of val g
val x (y :: (ctrls x 0 ps))</p>
        <p>= f def. of y , x 0, ctrls g
val x (ctrls x (p :: ps))
The Idris-proof for OptLemma (not shown here) is a little bit more technical but
the idea is very simple:
1. construct ps0 : PolicySeq n such that ctrls x ps 0 = ys0
2. apply optimality of ps to deduce Val x ps0 6 Val x ps</p>
      </sec>
      <sec id="sec-4-3">
        <title>3. apply valValLemma to deduce that val ys0 6 val (ctrls x ps )</title>
        <p>OptLemma ensures that optimal control sequences can be obtained from optimal
sequences of policies. This is particularly useful because optimal policy sequences
can be easily computed using Bellman's backwards induction algorithm. Before
turning to the formalization of Bellman's principle and to a generic
implementation of backwards induction, let us recall a trivial but important result: just as
in the case of empty sequences of controls, empty policy sequences are optimal:
nilIsOptPolicySeq : OptPolicySeq O Nil
nilIsOptPolicySeq x ps 0 = re exive Float lte 0
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Bellman's optimality principle and backwards induction</title>
      <p>
        Bellman's principle of optimality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is based on the notion of optimal extensions
of sequences of policies. Given a policy sequence ps, a policy p is an optimal
extension of ps if the value of p :: ps is at least as good as the value of p0 :: ps
for arbitrary p0. Formally:
      </p>
      <p>OptExt : PolicySeq n ! Policy ! Type
OptExt ps p = (p0 : Policy) !
(x : X ) !
so (Val x (p0 :: ps) 6 Val x (p :: ps))
Bellman's principle of optimality says that if we are given a policy sequence ps
of length n, a proof of optimality for ps and an optimal extension p of ps then
we can construct an optimal policy sequence of length n + 1 simply by consing
p with ps:</p>
      <p>Bellman : (ps : PolicySeq n) !</p>
      <p>OptPolicySeq n ps !
(p : Policy) !
OptExt ps p !</p>
      <sec id="sec-5-1">
        <title>OptPolicySeq (S n) (p :: ps)</title>
        <p>To prove Bellman, one has to show that, if ps : PolicySeq n is optimal and p is
an optimal extension of ps then p :: ps is optimal, that is</p>
        <p>Val x (p0 :: ps0) 6 Val x (p :: ps)
for arbitrary x : X and (p0 :: ps0) : PolicySeq (S n). A semi-formal proof is
straightforward:</p>
        <p>Val x (p0 :: ps0)</p>
        <p>= f def. of Val with x 0 = step x (p0 x ) g
reward x (p0 x ) x 0 + Val x 0 ps0</p>
        <p>6 f optimality of ps, monotonicity of + g
reward x (p0 x ) x 0 + Val x 0 ps</p>
        <p>= f def. of Val g
Val x (p0 :: ps)</p>
        <p>6 f p is an optimal extension of ps g</p>
        <p>Val x (p :: ps)
We can easily turn the semi-formal proof into a program and ask Idris to type
check Bellman's principle:
Let's assume that we can implement a function optExt : PolicySeq n ! Policy
that computes optimal extensions, that is, optExt ful lls</p>
        <p>OptExtLemma : (ps : PolicySeq n) ! OptExt ps (optExt ps)
The speci cation says that for all ps, optExt ps is an optimal extension of ps.
Recall that empty policy sequences are optimal: nilIsOptPolicySeq. Thus, one
can use optExt to compute optimal policy sequences backwards (Bellman, 1957):
backwardsInduction : (n : Nat) ! PolicySeq n
backwardsInduction O = Nil
backwardsInduction (S n) = ((optExt ps) :: ps) where
ps : PolicySeq n
ps = backwardsInduction n
This implementation of backwards induction is not particularly e cient6. But
it can be easily proved to be correct that is, to ful ll the speci cation:
BackwardsInductionLemma : (n : Nat) !</p>
      </sec>
      <sec id="sec-5-2">
        <title>OptPolicySeq n (backwardsInduction n)</title>
        <p>For n = O, BackwardsInductionLemma certainly holds because of empty policy
sequences are optimal:</p>
        <p>BackwardsInductionLemma O = nilIsOptPolicySeq
The induction step can be proved using Bellman's principle of optimality. All
one has to do is to construct the four arguments needed to apply Bellman:
BackwardsInductionLemma (S m) =</p>
        <p>Bellman ps ops p oep where
ps : PolicySeq m
ps = backwardsInduction m
ops : OptPolicySeq m ps
ops = BackwardsInductionLemma m
p : Policy
p = optExt ps
oep : OptExt ps p
oep = OptExtLemma ps
Notice the usage of the induction hypothesis BackwardsInductionLemma m in
the construction of ops, the proof that ps is optimal.
6 As it turns out, backwardsInduction executes in exponential time in n. If the state
space X is nite, it is easy to derive an implementation of backwards induction which
executes in linear time in n.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Optimal extensions of policy sequences</title>
      <p>It is easy to implement a function that computes the optimal extension of a
policy sequence ps if one can nd, for every x : X , an optimal control. This is a
y : Y x such that for all y0 : Y x
reward x y0 (step x y0) + Val (step x y0) ps
6
reward x y (step x y) + Val (step x y) ps
If Y x is nite, it is possible to nd an optimal control with a nite number
of comparisons. In general, we can compute an optimal extension of a policy
sequence if we have functions
max : (x : X ) ! (Y x ! Float ) ! Float
argmax : (x : X ) ! (Y x ! Float ) ! Y x
which ful ll the speci cations</p>
      <sec id="sec-6-1">
        <title>MaxSpec : Type</title>
        <p>MaxSpec = (x : X ) !
(f : Y x ! Float ) !
(y : Y x ) !
so (f y 6 max x f )</p>
      </sec>
      <sec id="sec-6-2">
        <title>ArgmaxSpec : Type</title>
        <p>ArgmaxSpec = (x : X ) !
(f : Y x ! Float ) !
so (f (argmax x f )</p>
        <p>max x f )
In this case
optExt : PolicySeq n ! Policy
optExt ps x = argmax x f where
f : Y x ! Float
f y = reward x y x 0 + Val x 0 ps where
x 0 : X
x 0 = step x y
ful lls OptExtLemma. This can be seen easily by equational reasoning. We have
to show that</p>
        <p>Val x (p0 :: ps) 6 Val x ((optExt ps) :: ps)
for arbitrary p0 : Policy. We have:</p>
        <p>Val x (p0 :: ps)</p>
        <p>= f def. of Val with x 0 = step x (p0 x ) g
reward x (p0 x ) x 0 + Val x 0 ps</p>
        <p>As for the case of Bellman and BackwardsInductionLemma, it is
straightforward to rewrite this proof in Idris and machine check its correctness.</p>
        <p>The existence of max and argmax is a condition which is simple to formulate
and su cient for computing optimal extensions, but it is also stronger than
necessary: it requires computing the maximum of any function of type Y x !
Float , which is in many cases intractable. In the optimal extension algorithm
we are however only interested in maximizing functions having the speci c form
determined by reward , Val and step. Weaker conditions can be formulated by
taking this into account and restricting the class of functions for which max and
argmax need to be implemented. For example, if reward and step are linear, so
are Val and f , and we could replace the general optimization required by max
and argmax with the simpler problem of linear optimization.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Summary</title>
      <p>For time-independent, deterministic sequential decision problems, we have
formalized the notions of state space, available controls, transition and reward
functions (the context), optimal control and policy sequences and Bellman's
principle of optimality. Provided we are given a function optExt which ful lls
OptExtLemma, we have derived a generic implementation of backwards
induction and proved that such implementation is correct, that is, that it computes
policy sequences which are optimal. From an optimal policy sequence, an
optimal control sequence can be recovered for arbitrary initial states through ctrls.
In the last section we have speci ed su cient conditions for optExt to be
implementable, under which we have derived a generic implementation of optExt .
7.1</p>
      <sec id="sec-7-1">
        <title>Outlook</title>
        <p>Our nal aim is to extend the formalization presented in this article to
sequential decision problems which can be time-dependent and for which step can be
deterministic, non-deterministic, stochastic or, more generally, monadic.</p>
        <p>
          We present such extension in detail in a follow-up paper. Here, we just hint at
the two di culties that have to be faced in deriving such a generic extension. A
rst di culty is that one has to nd a way of extending the formalization to the
case in which the target of step is a generic M -structure on X , for a monad M .
Obviously, a monadic step function naturally induces an M -structure on Float
via the reward function. As it turns out, one can apply the ideas developed in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
for monadic dynamical systems to derive a natural and fully generic approach
for iterating monadic step functions and to account for M -structures of rewards.
        </p>
        <p>Another di culty is that when X , and therefore Y , do depend on the number
of steps, one cannot assume, in general, that admissible controls exist for every
state. To see why this is the case, let us turn back to the cylinder problem and
consider a case in which the state space X is not constant, but depends on the
number of steps t :Nat . For concreteness, consider a case in which all ve columns
{ a, b, c, d and e { are valid for t &lt; 3 and t &gt; 3 but, at t = 3, only column e is
a valid state, see Figure 2, left.</p>
        <p>...</p>
        <p>...</p>
        <p>...
n
...</p>
        <p>The consequences of a reduced state space at time 3 are twofold. First, certain
states at t &gt; 3 become unreachable. These are the states which are \shadowed"
by the invalid columns at time 3. They are represented in light grey in the middle
of Figure 2.</p>
        <p>Second, certain states become non-viable: these are states at t &lt; 3 from
which we can take at most two steps. They are represented, also in light grey,
on the right of Figure 2. Not accounting for unreachable states can be ine cient
because optimal extensions are computed for states which e ectively cannot be
realized by any sequence of controls. This is particularly true if the set of options
associated to such unreachable states is big.</p>
        <p>Viability, however, has deeper implications. In Figure 2 right, the initial state
a is essentialy di erent from b, c, d or e: there is simply no trajectory of length
&gt;2 starting in a. Similarly a and b cannot be initial states of trajectories of
length &gt;1 starting at t = 1 and so on.</p>
        <p>Thus, extending the generic framework presented in this paper to the
timedependent case requires, on the one hand, formalizing the notion of viability
generically. This becomes particularly interesting in the general case of monadic
step functions. On the other hand, it requires accounting for viability constraints
both in the domain of policy functions and in their range | the set of controls
that can be selected at a given time. In computing optimal extensions for policy
sequences of length n at a time t , only admissible controls leading to states from
which (at least) n t further steps can be done are suitable candidates. Controls
leading to \dead-ends" have to be carefully identi ed and rejected.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Acknowledgments</title>
        <p>The work presented in this paper heavily relies on free software, among others
on hugs, GHC, vi, the GCC compiler, Emacs, LATEX and on the FreeBSD and
Debian / GNU Linux operating systems. It is our pleasure to thank all developers
of these excellent products. We also thank the anonymous referees for their
helpful comments.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. J. C. H.</given-names>
            <surname>Watkins</surname>
          </string-name>
          .
          <article-title>Learning and sequential decision making</article-title>
          .
          <source>Technical report, COINS Technical Report No. 89-95</source>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Bellman</surname>
          </string-name>
          .
          <source>Dynamic Programming</source>
          . Princeton University Press,
          <year>1957</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bertot</surname>
          </string-name>
          and
          <string-name>
            <given-names>P</given-names>
            <surname>Castran</surname>
          </string-name>
          .
          <article-title>Interactive theorem proving and program development, Coq'art: the calculus of inductive constructions</article-title>
          .
          <source>Texts in Theoretical Computer Science</source>
          . Springer,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Bertsekas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Dynamic</given-names>
            <surname>Programming</surname>
          </string-name>
          and
          <string-name>
            <given-names>Optimal</given-names>
            <surname>Control</surname>
          </string-name>
          . Athena Scienti c, Belmont, Mass.,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Richard</given-names>
            <surname>Bird</surname>
          </string-name>
          and Oege de Moor.
          <source>Algebra of Programming</source>
          . International Series in Computer Science. Prentice Hall, Hemel Hempstead,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Edwin</given-names>
            <surname>Brady</surname>
          </string-name>
          .
          <source>Programming in Idris : a tutorial</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>O. de Moor</surname>
          </string-name>
          .
          <article-title>A generic program for sequential decision processes</article-title>
          .
          <source>In PLILPS '95 Proceedings of the 7th International Symposium on Programming Languages: Implementations, Logics and Programs</source>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          23. Springer,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cormen</surname>
            <given-names>T. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivest</surname>
            <given-names>R. L.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Leiserson</surname>
            <given-names>C. E.</given-names>
          </string-name>
          <article-title>Introduction to algorithms</article-title>
          .
          <source>McGraw-Hill, second edition</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Cezar</given-names>
            <surname>Ionescu</surname>
          </string-name>
          .
          <article-title>Vulnerability Modelling and Monadic Dynamical Systems</article-title>
          .
          <source>PhD thesis</source>
          , Freie Universitat Berlin,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Ulf</given-names>
            <surname>Norell</surname>
          </string-name>
          .
          <article-title>Towards a practical programming language based on dependent type theory</article-title>
          .
          <source>PhD thesis</source>
          , Chalmers University of Technology,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>E. M. Reingold</surname>
          </string-name>
          , Nievergelt J.,
          <source>and Deo N. Combinatorial Algorithms: Theory and Practice</source>
          . Prentice Hall,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Matthieu</given-names>
            <surname>Sozeau</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nicolas</given-names>
            <surname>Oury</surname>
          </string-name>
          .
          <article-title>First-Class Type Classes</article-title>
          .
          <source>In Theorem Proving in Higher Order Logics (TPHOLs</source>
          <year>2008</year>
          ), pages
          <fpage>278</fpage>
          |-
          <lpage>293</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>