<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimal control of point-to-point navigation in turbulent flows using Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M. Buzzicotti</string-name>
          <email>michele.buzzicotti@roma2.infn.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Biferale</string-name>
          <email>biferale@roma2.infn.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>F. Bonaccorso</string-name>
          <email>fabio.bonaccorso@roma2.infn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>P. Clark diLeoni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>K. Gustavsson</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Life Nano</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mechanical Engineering, Johns Hopkins University</institution>
          ,
          <addr-line>Baltimore, Maryland 21218</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. Physiscs and INFN, University of Rome Tor Vergata</institution>
          ,
          <addr-line>Via della Ricerca Scientifica 1, 00133 Rome -</addr-line>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Dept. of Physics, University of Gothenburg</institution>
          ,
          <addr-line>Gothenburg, 41296</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present theoretical and numerical results concerning the problem to find the path that minimizes the time to navigate between two given points in a complex fluid and under realistic navigation constraints. We contrast deterministic Optimal Navigation (ON) control with stochastic policies obtained by Reinforcement Learning (RL) algorithms. We show that Actor-Critic RL algorithms are able to find quasi-optimal solutions in the presence of either time-independent or chaotically evolving flow configurations. For our application, ON solutions develop unstable behaviour within the typical duration of the navigation process, and are therefore not useful in practice. The explored setup consists of using a constant propulsion speed to navigate a turbulent flow. Based on a discretized phase-space the propulsion direction is adjusted with the aim to minimize the time spent to reach the target. Our approach can be generalized to other set-ups, for example unmanned navigation with minimal energy consumption under imperfect environmental forecast or with diferent models for the moving vessel.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Optimal Control</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Turbulence</kwd>
        <kwd>Unmanned Navigation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Controlling and planning paths of small autonomous marine vehicle1s] s[uch as wave and
current gliders [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], active drifters [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], buoyant underwater explorers, and small swimming
drones is important for many geo-physical 4[] and engineering [5] applications. In realistic
open environments, these vessels are afected by disturbances like wind, waves and ocean
currents, characterized by unpredictable (chaotic) trajectories. Furthermore, active control is
also limited by engineering and budget aspects as for the important case of unmanned drifters
for oceanic exploration 6[, 7]. The problem of (time) optimal point-to-point navigation in a
lfow, known as Zermelo’s problem [ 8], is interestingper se in the framework of Optimal Control
Theory [9]. In this paper, we report the results from a recent theoretical and numerical study
[10], tackling the Zermelo’s problem for the navigation in a two-dimensional fully turbulent
lfow in the presence of an inverse energy cascade, i.e. with chaotic, multi-scale and rough
and ending point,   , of our problem. We also show an illustrative navigation trajectory   . The flow
is obtained from a spatially periodic snapshot of a 2D turbulent configuration in the inverse energy
∑&lt;  &lt;+1 | ( )|2 ∼  −5/3. For
cascade regime with a multi-scale power-law Fourier spectrum, () =
RL optimization, the initial conditions are taken randomly inside a circle of radius   centered around
  . Similarly, the final target is the circle of radius   centered around   . The flow area is covered
by a grid-world with tiles   with  = 1, … ,   and   = 900 of size  × 
which identify the state-space
for the RL protocol. The large-scale periodicity of the underlying flow is  , and we fixed  = /10 .
Every time interval Δ , the unmanned vessel selects one of the 8 possible actions   with  = 1 … 8 (the
steering directions   depicted in left top inset according to a policy (|)
, where  is the probability
distribution of the action  given the current state  of the agent at that time. The policy is optimized
during the learning to maximizes the total reward,   , proportional to minus the navigation time,
      </p>
      <p>
        ∼ −   →  , such as the maximum reward corresponds to the time-optimal trajectory. To reach the
policy convergence the actor-critic method requires to accumulate experience over a number of the
order of 1000 diferent trajectories, with small variations depending on the values of  ̃s and the specific
flow properties. Right: spatial concentrations of trajectories for three values of  ̃s. The flow region is
color coded proportionally to the time the trajectories spend in each pixel area for both ON (red) and RL
(blue). Light colors refer to low occupation and bright to high occupation. The green-dashed line shows
the best ON out the 20000 trajectories. Right histograms: arrival time distribution for ON (red) and RL
(blue). Probability of not reaching the target within the upper time limit is plotted in the Fail bar.
velocity distributions 1[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], see Fig. 1 for a summary of the problem. In such a flow, even for
time-independent configurations, trivial or naive navigation policies can be extremely ineficient
and inefective if the set of actions by the vessel are limited. We show that an approach based
on semi-supervised AI algorithms using actor-critic Reinforcement Learning (RL)1[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is able
to find robust quasi-optimal -stochastic- policies that accomplish the task. Furthermore, we
compare RL with solutions from Optimal Navigation (ON) theory1[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and show that the latter is
of almost no practical use for the case of navigation in turbulent waters due to strong sensitivity
to the initial (and final) conditions, in contrast to what happens for simpler advecting flows
[14]. RL has shown to have promising potential to similar problems, such as the training of
smart inertial particles or swimming objects navigating intense vortex region1s5[, 16, 17].
      </p>
      <p>We present here results from navigating one static snapshot of 2D turbulence (for
timedependent flows see [ 10]). In Fig. 1 we show a sketch of the set-up. Our goal is to find (if they
exist) trajectories that join the region close to with a target close to  in the shortest time
supposing that the vessels obey the following equations of motion:
{ ̇ =  (  , ) +   (  )
  (  ) =  s (  )
(1)
where  (  , ) is the velocity of the underlying 2D advecting flow, and  (  ) =  s (  ) is
the control slip velocity of the vessel with fixed intensity s and varying steering direction:
 (  ) = (cos[  ], sin[  ]), where the angle is evaluated along the trajecto r y, = (   ). We
introduce a dimensionless slip velocity by normalizing with the maximum velocitymax of the
underlying flow:  ̃s =  s/ max. Zermelo’s problem reduces to optimize the steering direction
in order to reach the target 8[]. For time independent flows, optimal navigation (ON) control
theory gives a general solution1[8, 19]. Assuming that the angle is controlled continuously in
time, the optimal steering angle must satisfy the following time-evolution:
  ̇ =  21 sin2   −  12 cos2   + ( 11 −  22) cos   sin   ,
(2)
where   =     (  ) is evaluated along the agent trajecto ry obtained from Eq. (1). The set of
equations (1-2) may lead to chaotic dynamics even for time-independent flows in two spatial
dimensions. Due to the sensitivity to small perturbations in chaotic systems the ON approach
becomes useless for many practical applications.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods</title>
      <p>
        RL applications 1[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are based on the idea that an optimal solution can be obtained by learning
from continuous interactions of an agent with its environment. The agent interacts with the
environment by sampling its states , performing actions  and collecting rewards. In our case
the vessel acts as the agent and the two-dimensional flow as the environment. In the approach
used here, actions are chosen randomly with a probability that is given by the poli(c|y) ,
given the current flow-state  . The goal is to find the optimal policy ∗(|) that maximizes the
total reward , tot = ∑   , accumulated along one episode, see Fig1. for precise definition of
lfow-states and agent-actions. To identify a time-optimal trajectory we use a potential based
reward shaping [20] at each time during the learning process, see 1[0] for details. An episode
is finalized when the trajectory reaches the circle of radius  around the target. In order to
converge to robust policies each episode is started with a uniformly random position within a
given radius,  , from the starting point. To estimate the expected total future reward we follow
the one-step actor-critic method [12] based on a gradient ascent in the policy parametrization.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        In the right part of Fig. 1 we show the main results comparing RL and ON approaches1[0].
The minimum time taken by the best trajectory to reach the target is of the same order for
the two methods. The most important diference between RL and ON lies in their robustness
as seen by plotting the spatial density of trajectories in the right part of F1igf.or the optimal
policies of ON and RL with three values o f̃s. We observe that the RL trajectories (blue coloured
area) form a much more coherent cloud in space, while the ON trajectories (red coloured area)
ifll space almost uniformly. Moreover, for small navigation velocities, many trajectories in
the ON system approach regular attractors, as visible by the high-concentration regions. The
rightmost histograms in Fig. 1 show a comparison between the probability of arrival times for
the trajectories illustrated in the two-dimensional domain, providing a quantitative estimation
of the better robustness of RL compared to ON. Other RL algorithms, such as Q-learning1[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
could also be implemented and compared with other path search algorithms such as∗ which
is often used in many fields of computer science [ 21, 22].
      </p>
      <p>To conclude, we have discussed a systematic investigation of Zermelo’s time-optimal navigation
problem in a realistic 2D turbulent flow, comparing both RL and ON approaches1[0]. We
showed that RL stochastic algorithms are key to bypass unavoidable instability given by the
chaoticity of the environment and/or by the strong sensitivity of ON approaches in the presence
of non-linear flow configurations. RL methods ofer also a wider flexibility, being applicable
also to energy-minimization problems and in situation where the flow evolution is known only
in statistical sense as in partially observable Markov processLeest. us stress that it is possible
to implement RL strategies aimed to improve a-priori policy designed to particular problems
instead of staring from a completely random policy. For example one can imagine to use an RL
approach to optimize an initial “trivial” policy, where the navigation angle is selected as the
action that points most directly toward the target.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>L.B. and M.B. acknowledge funding from the European Union Programme (FP7/2007-2013)
AdG ERC grant No.339032. K.G. acknowledges funding from the Knut and Alice Wallenberg
Foundation, Grant No. KAW 2014.0048, and Vetenskapsrådet, Grant No. 2018-03974. F.B
acknowledges funding from the European Research Council under the European Union’s
Horizon 2020 Framework Programme (No. FP/2014-2020) ERC Grant Agreement No.739964
(COPMAT).
the instrument, its data, and some recent results, Lagrangian analysis and prediction of
coastal and ocean dynamics (2007) 39–67.
[4] P. F. Lermusiaux, D. Subramani, J. Lin, C. Kulkarni, A. Gupta, A. Dutt, T. Lolla, P. Haley,
W. Ali, C. Mirabito, et al., A future for intelligent autonomous ocean observing systems,
Journal of Marine Research 75 (2017) 765–813.
[5] C. Bechinger, R. Di Leonardo, H. Löwen, C. Reichhardt, G. Volpe, G. Volpe, Active particles
in complex and crowded environments, Reviews of Modern Physics 88 (2016) 045006.
[6] L. R. Centurioni, Drifter technology and impacts for sea surface temperature, sea-level
pressure, and ocean circulation studies, in: Observing the Oceans in Real Time, Springer,
2018, pp. 37–57.
[7] D. Roemmich, G. C. Johnson, S. Riser, R. Davis, J. Gilson, W. B. Owens, S. L. Garzoli,
C. Schmid, M. Ignaszewski, The argo program: Observing the global ocean with profiling
lfoats, Oceanography 22 (2009) 34–43.
[8] E. Zermelo, Über das navigationsproblem bei ruhender oder veränderlicher windverteilung,
ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte
Mathematik und Mechanik 11 (1931) 114–124.
[9] A. E. Bryson, Y. Ho, Applied optimal control: optimization, estimation and control, New</p>
      <p>York: Routledge, 1975.
[10] L. Biferale, F. Bonaccorso, M. Buzzicotti, P. Clark Di Leoni, K. Gustavsson, Zermelo’s
problem: Optimal point-to-point navigation in 2d turbulent flows using reinforcement
learning, Chaos: An Interdisciplinary Journal of Nonlinear Science 29 (2019) 103138.
[11] A. Alexakis, L. Biferale, Cascades and transitions in turbulent flows, Physics Reports
767-769 (2018) 1 – 101.
[12] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
[13] L. S. Pontryagin, Mathematical theory of optimal processes, Routledge, 2018.
[14] E. Schneider, H. Stark, Optimal steering of a smart active particle, arXiv preprint
arXiv:1909.03243 (2019).
[15] S. Colabrese, K. Gustavsson, A. Celani, L. Biferale, Smart inertial particles, Physical Review</p>
      <p>Fluids 3 (2018) 084301.
[16] S. Colabrese, K. Gustavsson, A. Celani, L. Biferale, Flow navigation by smart
microswimmers via reinforcement learning, Physical review letters 118 (2017) 158004.
[17] K. Gustavsson, L. Biferale, A. Celani, S. Colabrese, Finding eficient swimming strategies
in a three-dimensional chaotic flow by reinforcement learning, The European Physical
Journal E 40 (2017) 110.
[18] L. Techy, Optimal navigation in planar time-varying flow: Zermelo’s problem revisited,</p>
      <p>Intelligent Service Robotics 4 (2011) 271–283.
[19] G. Mannarini, N. Pinardi, G. Coppini, P. Oddo, A. Iafrati, Visir-i: small vessels–least-time
nautical routes using wave forecasts, Geoscientific Model Development 9 (2016) 1597–1625.
[20] Y. N. Andrew, D. Harada, S. Russelt, Policy invariance under reward transformations:</p>
      <p>Theory and application to reward shaping, ICML 99 (1999) 278.
[21] S. Russell, P. Norvig, Artificial intelligence: a modern approach (2002).
[22] J. Lerner, D. Wagner, K. Zweig, Algorithmics of large and complex networks: design,
analysis, and simulation, volume 5515, Springer, 2009.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Petres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pailhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Patron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Petillot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lane</surname>
          </string-name>
          ,
          <article-title>Path planning for autonomous underwater vehicles</article-title>
          ,
          <source>IEEE Transactions on Robotics</source>
          <volume>23</volume>
          (
          <year>2007</year>
          )
          <fpage>331</fpage>
          -
          <lpage>341</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Kraus</surname>
          </string-name>
          ,
          <article-title>Wave glider dynamic modeling, parameter identification and simulation</article-title>
          ,
          <source>Ph.D. thesis</source>
          , [Honolulu]:[University of Hawaii at Manoa],
          <source>[May</source>
          <year>2012</year>
          ],
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lumpkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pazos</surname>
          </string-name>
          ,
          <article-title>Measuring surface currents with surface velocity program drifters:</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>