-

Markov Approximation of Zero-sum Differential Games

Hence

0 0 Yurii Averboukh Krasovskii Institute of Mathematics and Mechanics S. Kovalevskoy str , 16, 620990 Yekaterinburg , Russia

73 80

The paper is concerned with approximations of a value function of a differential game. To this end we introduce a approximation of the original differential game by a continuous-time Markov game with infinite state space. The value function of this game in a given region is approximated by a solution of a finite state Markov game. This yields the approximation of the value function of the original differential game by the finite system of ODEs.

Introduction results are closed to the problems of thr approximation of the control system by Markov chains considered in [Boue & Dupuis, 1999], [Kushner & Dupuis, 2001].

Paper [Averboukh, 2016] is concerned with the general continuous-time stochastic game. A near optimal strategy for the continuous-time stochastic game is constructed using the value function of a game with a different dynamics. The general theory implies the approximation of the value function of the differential game by the infinite system of ODEs. This system is obtained using dynamic programming arguments for the approximating continuous-time Markov game.

In the paper we introduce the approximation of the value function of the differential game by the finite system of ODEs. This approach can be used for numerical schemes. The paper is organized as follows. In Section 2 we recall the general theory of differential games. Section 3 deals with the approximation of the original differential game by a Markov game with infinite state space. This leads to the approximation of the value function by infinite system of ODEs. The approximation of the differential game by a Markov game with the finite state space is considered in Section 4. In this section we obtain an approximation of the value function of the differential game by the finite system of ODEs. 2 x(t) = f (t, x(t), u(t), v(t)), t ∈ [0, T ], x(t) ∈ Rd, u(t) ∈ U, v(t) ∈ V. (1) Here u(t) (respectively, v(t)) is a instantaneous control of the first (respectively, second) player. The aim of the first (respectively, second) player is to minimize (respectively, maximize) the terminal payoff σ(x(T )).

We use feedback approach proposed by Krasovskii and Subbotin. To introduce this formalization let us denote by U [t0] (respectively, V[t0]) the set of all measurable functions from [t0, T ] to U (receptively, V ).

We assume that 1. the sets U and V are metric compacts; 2. the functions f and σ are continuous; 3. the functions f and σ are bounded by the constant M ; 4. there exists a constant K such that, for any t ∈ [0, T ], x′, x′′ ∈ Rd, u ∈ U , v ∈ V , d dt d dt

∥f (t, x′, u, v) − f (t, x′′, u, v)∥ ≤ K∥x′ − x′′∥; 5. there exists a function α : R → R such that α(δ) → 0 as δ → 0 and, for any t′, t′′ ∈ [0, T ], x ∈ Rd, u ∈ U , v ∈ V ,

∥f (t′, x, u, v) − f (t′′, x, u, v)∥ ≤ α(t′ − t′′);

6. the function σ is Lipschtiz continuous with the constant R;

7. (Isaacs’ condition) for any t ∈ [0, T ], x, ξ ∈ Rd, u ∈ U , v ∈ V , mu∈iUn mv∈aVx⟨ξ, f (t, x, u, v)⟩ = max min⟨ξ, f (t, x, u, v)⟩.

v∈V u∈U Below, let p be an arbitrary function from [0, T ] × Rd to U . We will construct a stepwise control that is N constant between the times of control correction. Let t0 be an initial time, ∆ = {ti}i=0 be a partition of [t0, T ]. The times ti are times of control correction. If x0 ∈ Rd, v ∈ V[t0], then let x1[·, t0, x0, p, ∆, v] be a solution of the problem

x(t) = f (t, x(t), p(ti−1, x(ti−1), v(t)), t ∈ [ti−1, ti], i = 1, . . . , N, x(t0) = x0.

Analogously, the strategy of the second player is determined by the function q : [0, T ] × Rd → V . If (t0, x0) is an initial position, ∆ = {ti}iN=0 is a partition of [t0, T ], u ∈ U [t0], then denote by x2[·, t0, x0, q, ∆, u] a solution of the problem x(t) = f (t, x(t), u(t), q(ti−1, x(ti−1)), t ∈ [ti−1, ti], i = 1, . . . , N, x(t0) = x0. Krasovskii and Subbotin proved that there exist functions p∗ : [0, T ] × Rd → U , q∗ : [0, T ] × Rd → V such that lim sup{σ(x1[T, t0, x0, p∗, ∆, v]) : d(∆) ≤ δ, v ∈ V[t0]} δ↓0 = =

inf lim sup{σ(x1[T, t0, x0, p, ∆, v]) : d(∆) ≤ δ, v ∈ V[t0]} p∈U[0;T ]×Rd δ↓0

sup lim inf{σ(x2[T, t0, x0, q, ∆, u]) : d(∆) ≤ δ, u ∈ U [t0]} q∈V [0;T ]×Rd δ↓0 = lim inf{σ(x2[T, t0, x0, q∗, ∆, u]) : d(∆) ≤ δ, u ∈ U [t0]} = Val(t0, x0).

δ↓0

Here BA stands for the set of functions from A to B.

The function Val is value function of the differential game. It is a viscosity/minimax solution of the Hamilton

Jacobi PDE

∂ϕ ∂t + H(t, x, ∇ϕ), ϕ(T, x) = σ(x).

Here

H(t, x, ξ) , min max⟨ξ, f (t, x, u, v)⟩.

u∈U v∈V (2)

Note that given a supersolution of (2) one can construct a suboptimal strategy of the first player. Analogous result holds true for the second player. She is to use a subsolution of (2). 3

General Theory of Markov Games To evaluate the value function we will use zero-sum Markov game. In the general case a Markov game is defined in the following way. Let S be a set. Assume that S is at most countable. Below S stands for the state space. For any t ∈ [0, T ], u ∈ U , v ∈ V , let Q(t, u, v) = (Qx,y(t, u, v))x,y∈S be a Kolmogorov matrix i.e., for any t ∈ [0, T ], u ∈ U , v ∈ V , x ∈ S, and for any y ∈ S, y ̸= x If u ∈ U [t0], v ∈ V[t0] be open-loop controls of the players, the state of the Markov chain at t ∈ [0, T ] is x, then under some continuity conditions the probability of the state y ̸= x at time t+ > t is whereas the probability of keeping of x at [t, t+] is ∑ Qx,y(t, u, v) = 0 y∈S

Qx,y(t, u, v) ≥ 0.

Qx,y(τ, u(τ ), v(τ ))dτ + o(τ );

Qx,x(τ, u(τ ), v(τ ))dτ + o(τ ). d dt ∫ t+

t 1 + ∫ t+

t Recall that Qx,x(t, u, v) is always nonpositive. If we denote the probability of the state x at time t by Px(t), we obtain the vector P (t) = (Px(t))x∈S . Below we assume that P (t) is a row-vector. The dynamics of the vector of probabilities P (t) is given by the Kolmogorov equation

P (t) = P (t)Q(t, u(t), v(t)), P (0) = P 0. (3) Here P 0 = (Px0)x∈S is an initial distribution.

Assume that the first (receptively, second) player wishes to minimize (respectively, maximize) the terminal payoff given by Eσ(X(T )), where X(t) is a state of the Markov chain corresponding to the Kolmogorov matrix Q(t, u(t), v(t)) at time t.

We do not put any analog of Isaacs condition on the Markov games. Thus, we are to suppose that one player is informed about the current player of her partner or use mixed strategies. Within the paper we assume that the first player has an information only on a current position, whereas the second player is informed about a current position and a current control of the first player. In this case the strategy of the first player is u(t, x), the strategy of the second player v(t, x, u). These strategies produce a Markov chain with the Kolmogorov matrix

Qbx,y(t) , Qx,y(t, u(t, x), v(t, x, u(t, x))).

Denote the corresponding Markov chain starting at (t∗, x∗) by Xt∗,x∗,u,v(·). The corresponding probability is denoted by P t∗,x∗,u,v. One may introduce the upper value function by the rule: η+(t∗, x∗) , min max Et∗,x∗,u,vσ(Xt∗,x∗,u,v(T )).

u v

Here Et∗,x∗,u,v stands for the expectation according to the probability P t∗,x∗,u,v.

Under some regularity conditions the value function satisfies Zachrisson equation that is an analog of Isaacs

Bellman equation:

d dt η+(t, x) + min max ∑ η+(t, y)Qx,y(t, u, v) = 0, η+(T, x) = σ(x), x ∈ S.

u∈U v∈V y∈S Note that (4) is a system of ODE. The existence result was established for the case of finite S in [Zachrisson, 1964].

If a solution of the (4) exist one can define the strategies of the players as follows: 

 u∗(t, x) , argmin mv∈aVx ∑ η+(t, y)Qx,y(t, u, v) ; u∈U y∈S

  v∗(t, x, u) , argmax ∑ η+(t, y)Qx,y(t, u, v) .

v∈V y∈S Using the standard verification arguments, one can prove that the strategies u∗ and v∗ are optimal, i.e. η+(t∗, x∗) = Et∗,x∗,u∗,v∗ σ(X(T, t∗, x∗, u∗, v∗)) = min Et∗,x∗,u,v∗ σ(X(T, t∗, x∗, u∗, v∗))

u = max Et∗,x∗,u∗,vσ(X(T, t∗, x∗, u∗, v)) = min max Et∗,x∗,u,vσ(X(T, t∗, x∗, u, v)). (5)

v u v 4

Approximating Infinite State Markov Chain Given a differential game with dynamics (1) define the approximating Markov game in the following way.

Let h be a positive number, f (t, x, u, v) = (f1(t, x, u, v), . . . , fd(t, x, u, v)) and let ei denote the i-th coordinate vector. Put χi(t, x, u, v) =  −eei,i,  0, fi1(t, x, u, v) > 0, fi1(t, x, u, v) < 0, fi1(t, x, u, v) = 0.

Let the state space be S , hZd. Put the Kolmogorov matrix be equal to

Qxhy(t, u, v) =  h1 |fi(t, x, u, v)|,

− h1 ∑id=1 |fi(t, x, u, v)|,  0, y = x + hχi(t, x, u, v), x = y, y ̸= x, y ̸= x + hχi(t, x, u, v),

In this case the Zachrisson equation (4) takes the form:

ddt ηh+(t, x) + min max ∑d |fi(t, x, u, v)| ηh+(t, x + hχi(t, x, u, v)) − ηh+(t, x) = 0, ηh+(T, x) = σ(x).

u∈U v∈V i=1 h Here x ∈ hZd is a parameter.

The following statement is proved in [Averboukh, 2016].

Proposition 1. There exists an unique solution of (7). (4) (6) (7)

The following theorem is also proved in [Averboukh, 2016].

Theorem 1. There exists a constant C1 determined by the function f such that if ηh+ is a solution of (7), then, for t0 ∈ [0, T ], x0 ∈ hZd, √ |Val(t0, x0) − ηh+(t0, x0)| ≤ RC1 h. (8) Note that in [Averboukh, 2016] it is proved that C1 = √2dM T e(3+2K)T .

Further, the optimal strategies of the players take the form: u∗(t, x) , argmin max ∑ |fi(t, x, u, v)| ηh+(t, x + hχi(t, x, u, v)) − ηh+(t, x) ] ;

[ d u∈U v∈V i=1 h v∗(t, x, u) , argmax ∑ |fi(t, x, u, v)| ηh+(t, x + hχi(t, x, u, v)) − ηh+(t, x) ] .

[ d v∈V i=1 h 5

Approximating finite state Markov chain Assume that we are interesting in the approximation of the value function in the set

To this end we consider the Markov chain with the state space Let

Define the Kolmogorov matrix Q♮,h(t, u, v) = (Q♮x,,hy(t, u, v))x,y ∈ Λρ in the following way: Λr , {

} x ∈ hZd : max |xi| ≤ r .

i=1,d Λρ , Γρ , {

} x ∈ hZd : max |xi| ≤ ρ .

i=1,d {

} x ∈ hZd : max |xi| = ρ .

i=1,d Q♮x,,hy(t, u, v) , { Qxh,y(t, u, v), x ∈ Λρ−h,

0, x ∈ Γρ. d dt ηh+(t, x) = 0, x ∈ Γρ, ηh+(T, x) = σ(x).

We assume that the purposes of the player in this game is the same as above i.e. the first (respectively, second) player wishes to minimize (respectively, maximize) the terminal payoff Eσ(Y (T )), where Y (t) denotes the state of the corresponding Markov chain.

For the Kolmogrov matrix Q♮,h(t, u, v) equation (4) takes the form:

ddt ηh+(t, x) + min max ∑d |fi(t, x, u, v)| ηh+(t, x + hχi(t, x, u, v)) − ηh+(t, x) u∈U v∈V i=1 h = 0, x ∈ Λρ−h, (9) Note that system (9) is a finite system of ODEs. Thus, it can be solved using a numerical method.

Let us estimate the difference between ηh(t, x) and η♮,h(t, x) for t ∈ [0, T ], x ∈ Λr. This estimate and Theorem 1 will lead the approximation of the value function by the solution of the finite system of ODEs.

If u(t, x) is a strategy of the first player, v(t, x, u) is a strategy of the second player, then denote the corresponding Markov chain determined by the Kolmogorov matrix (Q♮x,,hy(t, u(t, x), v(t, x, u(t, x)))) and starting at (t∗, x∗) by Y T∗,x∗,u,v. Further, denote by Pt∗,x∗,u,v and E t∗,x∗,u,v the corresponding probability and the expectation respectively.

Without loss of generality we can assume that P t∗,x∗,u,v and Pt∗,x∗,u,v are defined on the same measurable space.

Analogously, let

We have that, for any event C such that C ∩ Atρ∗,x∗,u,v = C ∩ Bρt∗,x∗,u,v = ∅, Lemma 1. For any t∗ ∈ [0, T ], x∗ ∈ Λr, and any strategies of the players u and v,

Atρ∗,x∗,u,v = {Xt∗,x∗,u,v(t) ∈/ Λρ−h for some t ∈ [t∗, T ]}.

Bρt∗,x∗,u,v = {Y t∗,x∗,u,v(t) ∈/ Λρ−h for some t ∈ [t∗, T ]}.

P t∗,x∗,u,v(C) = Pt∗,x∗,u,v(C).

P t∗,x∗,u,v(Atρ∗,x∗,u,v) ≤ d P t∗,x∗,u,v(Bρt∗,x∗,u,v) ≤ d Lt[u, v]φ(x) = ∑ Qx,y(t, u, v)φ(y).

y∈S d dt

Eφ(X(t)) = ELt[u(t), v(t)]φ(X(t)).

Proof. We shall prove estimate (11). To this end given a controlled Markov chain with the Kolmogorov matrix Q(t, u, v) let us introduce the generator acting on the continuous functions by the following rule: It follows from [Kolokoltsov, 2011] that if X(t) is a Markov chain corresponding to the Kolmogorov matrix Q and E is the corresponding probability, then

For the Markov chain with the Kolmogorov matrix Qh(t, u, v) the generator takes the form: Lth[u, v]φ = 1 d

|fi(t, x, u, v)| ∑[φ(x + hχi(t, x, u, v)) − φ(x)].

h i=1 Further, for x = (x1, . . . , xd), put ψi(x) , ∥xi∥2. Equality (14) implies that for any u, v

Lt[u, v]ψ(x) = h|χi(t, x, u, v)|2 + 2xiχi(t, x, u, v)|fi(t, x, u, v)|.

Therefore, (13) yields the following equation:

d Et∗,x∗,u,v(Xit∗,x∗,u,v(t))2 = hEt∗,x∗,u,v|χi(t, Xit∗,x∗,u,v(t), u, v)|2 dt

Here Xt∗,x∗,u,v(t) stands for the i-th coordinate of Xt∗,x∗,u,v(t).

i Since, |χi(t, x, u, v)| = 1, we have that

+ Et∗,x∗,u,v2Xit∗,x∗,u,v(t)χi(t, Xit∗,x∗,u,v(t), u, v)|fi(t, Xit∗,x∗,u,v(t), u, v)|. d Et∗,x∗,u,v(Xit∗,x∗,u,v(t))2 ≤ h + Et∗,x∗,u,v2|fi(t, Xit∗,x∗,u,v(t), u, v)||Xit∗,x∗,u,v(t)|. dt

Since fi is bounded by M we have that

Thus, d Et∗,x∗,u,v(Xit∗,x∗,u,v(t))2 ≤ h + 2M Et∗,x∗,u,v|Xit∗,x∗,u,v(t)|. dt d Et∗,x∗,u,v(Xit∗,x∗,u,v(t))2 ≤ h + M 2 + Et∗,x∗,u,v(Xit∗,x∗,u,v(t))2. dt (10) (11) (12) (13) (14)

Therefore, Therefore, Using Gronwall’s inequality, we get that Using Markov inequality, we get

Et∗,x∗,u,v(Xit∗,x∗,u,v(t))2 ≤ ∥x∗,i∥2 + (h + M 2)(t − t∗) + Et∗,x∗,u,v(Xit∗,x∗,u,v(τ ))2dτ. ∫ t t∗

Et∗,x∗,u,v(Xit∗,x∗,u,v(t))2 ≤ (∥x∗,i∥2 + (h + M 2)(t − t∗))et−t∗ . Et∗,x∗,u,v

sup (Xit∗,x∗,u,v(t′))2 ≤ t′∈[t∗,T ]

sup t′∈[t∗,T ]

Et∗,x∗,u,v(Xit∗,x∗,u,v(t′))2 ≤ (r2 + (h + M 2)T 2)eT . P (

sup |Xit∗,x∗,u,v(t)| ≥ ρ t∈[t∗,T ] ) ≤ .

P (Atρ∗,x∗,u,v) ≤ ∑ P i (

sup |Xit∗,x∗,u,v(t)| ≥ ρ t∈[t∗,T ] ) ≤ d .

Inequality (12) is proved in the same way.

Theorem 2. There exists a constant C2 such that for any h one can choose ρ satisfying |Val(t0, x0) − η♮,h(t0, x0)| ≤ C2h.

Proof. We shall estimate |ηh(t∗, x∗) − η♮,h(t∗, x∗)|.

Further, let Atρ∗,x∗,u,v (respectively, Btρ∗,x∗,u,v) be a complement of At∗,x∗,u,v (respectively, Bρt∗,x∗,u,v). Using Lemma 1, we get ηh(t∗, x∗) = min max Et∗,x∗,u,vσ(Xt∗,x∗,u,v)

u v ≤ min max Et∗,x∗,u,vσ(Xt∗,x∗,u,v)1At∗;x∗;u;v + max max Et∗,x∗,u,vσ(Xt∗,x∗,u,v)1At∗;x∗;u;v u v u v ≤ min max E t∗,x∗,u,vσ(Y t∗,x∗,u,v)1Bt∗;x∗;u;v + M d u v (r2 + (h + M 2)T 2)eT ρ ≤ min max E t∗,x∗,u,vσ(Y t∗,x∗,u,v) + 2M d u v = η♮,h(t∗, x∗) + 2M d .

The opposite inequality is proved in the same way. Thus,

|ηh(t∗, x∗) − η♮,h(t∗, x∗)| ≤ 2M d .

Once can choose ρ such that 2M d(r2 + (h + M 2)T 2)eT /ρ ≤ h. Using Theorem 1, we get the conclusion of the

Theorem.

Acknowledgements This work was supported by the Russian Science Foundation (project no. 17-11-01093).

[Averboukh , 2016] Averboukh, Y. ( 2016 ). Approximate solutions of continuous-time stochastic games . SIAM J. Control Optim ., 54 ( 5 ): 2629 - 2649 .

[Botkin et al., ] Botkin , N. , Hoffmann , K.-H., and Turova , V. L. Stable numerical schemes for solving hamiltonjacobibellmanisaacs equations . SIAM J. Control Optim ., 33 ( 2 ): 9921007 .

[Boue & Dupuis , 1999] Boue, M. and Dupuis , P. ( 1999 ). Markov chain approximations for deterministic control problems with affine dynamics and quadratic cost in the control . Siam J. Numer Anal. , 36 : 667 - 695 .

[Buslaeva , 1977] Buslaeva, L. T. ( 1977 ). Stochastic control in a differential game . J. Appl. Math. Mech. , 42 ( 4 ): 609 - 623 .

[Camilli & Marchi , 2012] Camilli, F. and Marchi , C. ( 2012 ). Continuous dependence estimates and homogenization of quasi-monotone systems of fully nonlinear second order parabolic equations . Nonlinear Anal. , 75 ( 13 ): 5103 - 5118 .

[Elliott & Kalton , 1972] Elliott, R. J. and Kalton , N. J. ( 1972 ). Values in differential games . Bull. Amer. Math. Soc. , 78 ( 3 ): 427 - 431 .

[Jorgensen & Zaccour , 2006] Jorgensen, S. and Zaccour , G. ( 2006 ). Developments in differential game theory and numerical methods: economic and management applications . Computational Management Science , 4 : 159181 .

[Kolokoltsov , 2011] Kolokoltsov, V. N. ( 2011 ). Markov processes, semigroups and generators , volume 38 of De Gruyter Studies in Mathematics. De Gryuter.

[Kolokoltsov , 2012] Kolokoltsov, V. N. ( 2012 ). Nonlinear Markov games on a finite state space (mean-field and binary interactions) . Int. J. Stat. Probab. , 1 ( 1 ): 77 - 91 .

[Krasovskii & Kotelnikova , 2009] Krasovskii, N. N. and Kotelnikova , A. N. ( 2009 ). Unification of differential games, generalized solutions of the hamilton-jacobi equations, and a stochastic guide . Differ . Equ., 45 ( 11 ): 1653 - 1668 .

[Krasovskii & Kotelnikova , 2010a] Krasovskii , N. N. and Kotelnikova , A. N. ( 2010a ). An approach-evasion differential game: stochastic guide . Proc. Steklov Inst. Math., 269 Supplement (1 Supplement): 191 - 213 .

[Krasovskii & Kotelnikova , 2010b] Krasovskii , N. N. and Kotelnikova , A. N. ( 2010b ). On a differential interception game . Proc. Steklov Inst. Math. , 268 ( 1 ): 161 - 206 .

[Krasovskii & Subbotin , 1973] Krasovskii, N. N. and Subbotin , A. I. ( 1973 ). Approximation in a differential game . J. Appl. Math. Mech. , 37 ( 2 ): 185 - 192 .

[Krasovskii & Subbotin , 1988] Krasovskii, N. N. and Subbotin , A. I. ( 1988 ). Game-theoretical control problems . Springer, New York.

[Kushner , 2007] Kushner, H. ( 2007 ). Numerical approximations for stochastic differential games: The ergodic cost criterion . In Jorgensen, S., Quincampoix , M. , and Vincent , T. L., editors, Advances in Dynamic Game Theory Numerical Methods , Algorithms, and Applications to Ecology and Economics, pages 617 - 638 . Springer.

[Kushner & Dupuis , 2001] Kushner, H.

and

Dupuis , P. G. ( 2001 ). Numerical Methods for Stochastic Control Problems in Continuous Time . Springer, New York.

[Li & Song , 2007] Li , X. and Song , Q. ( 2007 ). Markov chain approximation methods on generalized hjb equation . In Proceedings of the 46th IEEE Conference on Decision and Control , pages 4069 - 4074 , New Orleans, LA, USA.

[Subbotin , 1995] Subbotin, A. I. ( 1995 ). Generalized solutions of first-order PDEs. The dynamical perspective . Birkhauser , Boston.

[Subbotin & Chentsov , 1981] Subbotin, A. I. and Chentsov , A. G. ( 1981 ). Optimization of guarantee in control problems . Nauka, Moscow. in Russian.

[Varaya & Lin , 1969] Varaya, P. and Lin , J. ( 1969 ). Existence of saddle points in differential games . Siam J. Control , 7 ( 1 ): 142 - 157 .

[Zachrisson , 1964] Zachrisson, L. E. ( 1964 ). Markov games . In Dresher, M. , Shapley , L. S. , and Tucker , A. W., editors, Advances in game theory , pages 211 - 253 . Princeton University Press, Princeton.