<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Autonomous drone interception with Deep Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Bertoin</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrien Gaufriau</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damien Grasset</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jayant Sen Gupta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Airbus AI Research</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Airbus Operations</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>IRT Saint-Exupery Montreal</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>IRT Saint-Exupery</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>ISAE-SUPAERO</institution>
          ,
          <addr-line>Toulouse</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Driven by recent successes in artificial intelligence, new autonomous navigation systems are emerging in the urban space. The adoption of such systems raises questions about certification criteria and their vulnerability to external threats. This work focuses on the automated anti-collision systems designed for autonomous drones evolving in an urban context, less controlled than the conventional airspace and more vulnerable to potential intruders. In particular, we highlight the vulnerabilities of such systems to hijacking, taking as example the scenario of an autonomous delivery drone diverted from its mission by a malicious agent. We demonstrate the possibility of training Reinforcement Learning agents to deflect a drone equipped with an automated anti-collision system. Our contribution is threefold. Firstly, we illustrate the security vulnerabilities of these systems. Secondly, we demonstrate the efectiveness of Reinforcement Learning for automatic detection of security flaws. Thirdly, we provide the community with an original benchmark based on an industrial use case.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automated anti-collision system</kwd>
        <kwd>Autonomous Drone</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Urban Mobility</kwd>
        <kwd>UAV</kwd>
        <kwd>Security</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Spurred on by the recent advances in Machine Learning, autonomous vehicles are destined
to become an integral part of tomorrow’s mobility. So far limited to land conveyance, a new
category of autonomous vehicles (e.g., air cabs, delivery drones) is about to take place in the
urban mobility space. In a similar way to self-driving cars, unmanned aerial vehicles (UAVs)
cause new public acceptance and certification challenges. Guaranteeing the non-collision of
these UAVs with the obstacles present in the urban landscape and other flying objects is part
of these challenges. If flight plans and airways are used to prevent collisions in the traditional
airspace, they most often require human supervision. The potential amount of UAVs circulating
simultaneously in the urban air mobility contexts makes their utilization unfeasible. Thus,
many UAVs are currently equipped with automatic collision avoidance systems. These systems
are subject to drastic controls ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) to obtain the certifications allowing them to be
integrated into UAVs. Although these certifications guarantee these systems’ proper functioning
and eficiency, the deterministic aspect of their decisions makes their behavior predictable.
We argue that the predictability of their behaviors exposes ownships equipped with such
systems to malicious attacks. In this paper, we take advantage of Reinforcement Learning (RL)
agents’ exploration and exploitation capabilities to demonstrate the possibility of diverting
an autonomous drone to an interception zone by hacking its obstacle avoidance system. We
implement a simulator based on the scenario of an autonomous delivery drone trying to reach a
delivery area and being hijacked by another drone. We use RL to train an autonomous attacker
to position itself to deceive the anti-collision system and therefore, modify his target’s trajectory
towards the desired zone. Our work thus highlights a critical security flaw of UAVs and opens
the way to the use of RL in the certification procedures of such systems.
      </p>
      <p>This paper is organized as follows. Section 2 presents the related works. Section 3 introduces
background knowledge on the ACAS-X anti-collision system family and Reinforcement Learning.
Section 4 presents the general interception scenario and the simulator design. Section 5 shows
empirical results and discusses the critical components of successful hijackings. Finally, Section
6 summarizes and concludes this paper on future perspectives.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Security in avionics. Ongoing research into security vulnerabilities plays a fundamental
role in the fight against hijacking. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors exploit vulnerabilities of the Aircraft
Communication Addressing and Reporting System (ACARS) network to upload new flight plans
in the aircraft’s Flight Management System (FMS). Nevertheless, since the pilots still carry out
the control of the aircraft, the attack mainly leads to an increase in their workload. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] directly
targets the ground infrastructure and designs two diferent attacks, using radio signals, on the
Instrument Landing System (ILS). The authors develop two diferent attacks on the ILS using
radio signals that prevent the aircraft from landing successfully. They demonstrate a consistent
success rate with ofset touchdowns from 18 to over 50 meters laterally and longitudinally. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
highlights the vulnerabilities of the Avionics Wireless Networks (AWN). [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] describes a detailed,
plausible attack reaching the avionics network efectively on a commercial airplane. Due to the
complexity and cost of deploying such attacks, most of these results remain theoretical. To date,
none of these attacks enable complete control of the aircraft. Their impact is often mitigated by
the presence of humans in the loop. In the future, the introduction of low complex autonomous
systems, combined with the popularization of drones and their accessibility, may change this
paradigm. Our work focuses on this new generation of UAVs, which is about to occupy the
urban airspace, and highlights their vulnerabilities to malicious attacks.
      </p>
      <p>
        Automatic anti-collision systems Although supplemented by pilots’ interventions,
automated collision detection and avoidance plays a crucial role in the safety of flying vehicles.
These systems are mainly used for airliners to provide maneuvering recommendations to the
pilot and thus lighten his general workload. The most commonly used is the Trafic Collision
Avoidance System (TCAS). Unlike newer systems, the TCAS requires both aircrafts to be
equipped to be efective. Based on [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the Airborne Collision Avoidance System X (ACAS X)
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] has been designed for autonomous vehicles. The ACAS systems rely upon probabilistic
models to represent the various sources of uncertainty and upon computer-based optimization
to provide the best possible collision avoidance recommendations. Its robustness has been
demonstrated notably in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] using Petri model and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] using formal methods. Within the
ACAS X family, the ACAS-Xu is specifically designed for unmanned aircraft systems (UAS).
Due to the UAS reduced capabilities, this variant solves conflicts using horizontal motions.
However, the cost table size (∼ 5GB) makes it challenging to be embedded within small UAVs.
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] study the possibility to replace these look-up tables with surrogate models
to reduce the system’s footprint. Nonetheless, the use of surrogates raises the problem of their
certification. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] employ formal methods to provide guarantees on these models.
Our contribution highlights security flaws of the ACAS-Xu but could as well be extended to
other ACAS-X versions.
      </p>
      <p>
        RL for Unmanned Aerial Vehicle. Recent successes in Reinforcement Learning have
demonstrated the ability of autonomous agents to outperform humans in many tasks [
        <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22 ref23">19, 20, 21, 22, 23</xref>
        ]
and several works have applied it for the control and navigation of UAVs. RL provides a promising
alternative to classical stability and control methods such as Proportional-Integral-Derivative
(PID) control systems [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. While PID control has demonstrated excellent results in stable
environments, it is less efective in unpredictable and harsh environments. Recently, several
research projects have explored the possibility of using Reinforcement Learning to address its
limitations. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] compares the eficiency of a model based Reinforcement Learning controller
with Integral Sliding Mode (ISM) control [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. The authors of [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] train a neural network policy
for quadrotor controllers using an original policy optimization algorithm with Monte-Carlo
estimates. The learned policy manages to stabilize the quadrotor in the air even under very
harsh initializations, both in simulation and with a real quadrotor. [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] trains autonomous
controllers flight control systems with state-of-the-art model-free deep Reinforcement Learning
algorithms (Deep Deterministic Policy Gradient [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], Trust Region Policy Optimization [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ],
Proximal Policy Optimization [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]) and compares their performance with PID controllers. In
[
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], a sequential latent variable model is learned from flying sequences of an actual drone
controlled with PID. This latent dynamic model is used as a generative model to learn a deep
model-based RL agent directly on real drones with a limited number of steps.
      </p>
      <p>
        Mission planning is another classical use case of RL for UAVs. In [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], the authors combine
a Q-learning [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ] algorithm focusing on navigation policy with PID controllers. In [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], the
navigation problem is decomposed into two simpler sub-tasks (collision-avoidance and
approaching the target), each of them solved by a separate neural network in a distributed deep
RL framework. An active field of research focuses on interception and defense against malicious
drones. In a 1 vs 1 close combat situation, [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] demonstrates the efectiveness of an A3C [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] RL
agent versus an opponent with Greedy Shooter policy [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. In a multi-agent context, [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] uses a
Multi-Agent Deep Deterministic Policy Gradient algorithm (MADDPG) [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] in an attack-defense
confrontation Markov game. [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] proposes a ground defense system trained with Q-learning
to choose between high-level defense strategies (GPS spoofing, jamming, hacking, and laser
shooting). While [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] and [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ] use RL to train a drone attacker to intercept a target drone, [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ]
places the agent in the defender’s position and trains it with a Soft Actor-Critic algorithm [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ]
to avoid capture. [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ] propose an adaptive stress testing based on Monte Carlo Tree Search [
        <xref ref-type="bibr" rid="ref47">47</xref>
        ]
to find the most likely state trajectories leading to near mid-air collisions.
      </p>
      <p>Our contribution also aims at intercepting a target UAV using an RL agent. However, it difers
on two significant points. First, it highlights the security flaws of a deterministic policy dictated
by the ACAS-Xu system for collision avoidance. Second, our attacker does not seek to capture
the target directly but to guide it to a specific area where it can potentially be captured. This
strategy does not require any attack equipment directly implemented on the attacking UAV and
can easily be applied to any UAV.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <sec id="sec-3-1">
        <title>3.1. Overview of the ACAS system</title>
        <p>The Airborne Collision Avoidance System X (ACAS X) is a collection of rules providing an
optimized decision logic for airborne collision avoidance. Among the family of ACAS X, the
ACAS-Xu is specifically designed for drones and urban air mobility. By requesting a set of lookup
tables (LUT) computed ofline, the ACAS-Xu provides, in real-time, a horizontal resolution
of conflicts. Using data collected from its surrounding environment, the ownship queries the
LUT on the collisions probabilities for five diferent directional recommendations: COC (Clear
of Conflict for the current heading), WL (Weak Left), WR (Weak Right), L (Left), R (Right).
Therefore, the ACAS-Xu is used to avoid any static object (tower, crane, antenna) or any moving
object (e.g., birds, UAV) whose behavior could be unpredictable or even using the same avoidance
system. No particular maneuver is required when the ownship is in the COC state. Otherwise,
the autonomous agent may decide to follow the advisory that minimizes the probability of
conflicts according to the geometric parameters. We describe, in Table 1, these parameters along
with their unit of measure and illustrate in Figure 1 the example of an intruder crossing the
ownship’s trajectory.</p>
        <p>
          A complete description of the ACAS-Xu system can be found in [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Reinforcement Learning</title>
        <p>Within the field of Machine Learning, Reinforcement Learning is particularly suited for
sequential decision-making problems. In RL, an agent interacts with its environment during a
sequence of discrete time steps,  = 0, 1, 2, 3, .... At each time step , the agent is provided
a representation of the environment state  ∈ , where  is the space of all possible states.
Considering , the agent takes an action  ∈ , where  is the space of all possible actions
(including inaction). Performing this action in the environment causes the environment to
transition from  to +1, and as a consequence of this transition, the agent receives a numerical
reward  ∈ R.</p>
        <p>+1</p>
        <p />
        <p>Agent</p>
        <p>
          Environment
∇ () = E∼   ,∼   [∇ log  ( | ) (, )] ,
where  (, ) = E∼   ,∼   [|, ] is the action-value function.  (, ) represents the
expected return of performing action  in state  and following  afterwards. Policy gradients
methods typically require an estimate of  (, ). An approach used in actor-critic methods
consists in using a parameterized estimator called critic to estimate  (, ) (  thus represents
the actor part of the agent). By relying on this principle, [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ] propose the deterministic policy
gradient algorithm to compute ∇ ():
∇ () = E∼ 
[︁
        </p>
        <p>∇ (, )|= () ∇ ()]︁ .</p>
        <p>
          The Deep Deterministic Policy Gradient (DDPG) algorithm [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] adapts the ideas underlying
the success of Deep Q-Learning [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ] [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] to estimate  with a neural network with parameters
 . In DDPG the learned Q-function tends to overestimate  (, ), thus leading to the policy
exploiting the Q-function estimation errors. Inspired by the Double Q-learning [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ], the Twin
Delayed DDPG (TD3) [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ] addresses this overestimation by taking the minimum estimation
between a pair of critics and adding noise to the actions used to form the Q-learning target.
Combined with a less frequent policy update (one update every  critic updates), these tricks
result in substantially improved performance over DDPG in a number of challenging tasks in
the continuous control setting. Algorithm 1 describes TD3’s training procedure.
        </p>
        <p>end</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Drone interception scenario</title>
      <p>We consider an environment composed of two agents ( , ) and two areas of interest (1, 2)
inside a delimited playground. The first agent  , which we refer to as the target, is a delivery
drone that has the mission to reach the delivery area 1. This target drone is equipped with
the ACAS-Xu system, introduced in section 3.1. The target starts in position , has a fixed
velocity , and uses the autonomous avoidance system to modify its trajectory when a possible
conflict is detected. The second agent , referred to as the attacker, aims at hijacking the
target towards an alternate delivery area, called the interception area 2. To do so, the attacker
exploits the potential flaws induced by the target’s use of the avoidance ACAS-Xu system to
deflect its trajectory toward the interception area 2. The attacker starts in position  and can
continuously adjust its velocity  within the range [0, ].</p>
      <p>Both agents interact in a 2-dimensional playground. Gravity is not considered, and the scene
does not contain any obstacles. Figure 3 provides a schematic representation of the set-up.</p>
      <p>A</p>
      <sec id="sec-4-1">
        <title>Target’s heading</title>
        <p>T
1
2</p>
      </sec>
      <sec id="sec-4-2">
        <title>ACAS advisory</title>
        <p>One can note that the Xu version of the ACAS system is dedicated to drones and only provides
horizontal avoidance recommendations. Among the other versions of the ACAS, the Xa version
is dedicated to large aircraft and provides vertical avoidance. For the sake of simplicity, we
restrict the interception to a horizontal plan and the use of the ACAS-Xu avoidance system
solely. Adding a vertical dimension would not necessarily complexify the learning task (given
some adjustments to the reward model). We let the study of a 3D interception with a target
equipped with both the Xu and the Xa version to future work.</p>
        <sec id="sec-4-2-1">
          <title>The simulator environment</title>
          <p>We implemented this scenario in an original open-source simulator1. In the following, we
describe its main characteristics.</p>
          <p>State Space : The state of the environment at step  is fully described by the state of the two
agents (target and attacker) as well as the cartesian positions of the delivery and interception
 = (, ) and its velocity
areas. Each agent’s state is composed of its cartesian positions 
vector in the horizontal plan ⃗ . We denote by  (resp. V), the heading’s angle (resp. the
norm) of the velocity vector, and use ⃗ = (,  ) as its representation. The last recommendation
− 1 of the ACAS-Xu system is also included in the state.</p>
          <p>Action Space : The attacker’s action vector at step  is composed of two updates   ∈
[− 200, +200] (ft/s) and   ∈ [− 2 , 2 ] (rad) representing respectively an update of its velocity
and its heading.</p>
          <p>Transitions: The target agent’s velocity is constant during the whole episode. Its heading is
updated according to the ACAS-Xu recommendations. If the advisory provided is diferent from
COC, the following heading update   is performed: WL → + 0.15 (rad), WR → - 0.15 (rad),
L → + 0.3 (rad) and R → - 0.3 (rad). If the advisory is COC, the  update allows the target to
reach the optimal heading pointing to the delivery area with a maximum variation of 0.3 (rad).
This variation is constrained to represent actual drone maneuverability and avoid instability
1Since the ACAS-Xu lookup tables are exclusively available to organizations that are part of RTCA/EUROCAE
(the authority responsible for the standardization of ACAS-Xu), a mode allowing to run the simulator with a
surrogate model emulating the tables is available. Link available by the time of the conference.
(a) Attacker’s point of view
(b) Top view
due to big turns. Both agents’ positions  are updated by +1 =  + ⃗ +1 with the update
of the speed vector given by:
‖+1‖ = ‖‖ +</p>
          <p>+1 =   +  
Reward model : RL agents are strongly impacted by the reward model used during training.
The environment implements the following reward schema:
 =</p>
          <p>0
{︃− 1 + (2 − −2 1) if 1 ≥
1</p>
          <p>&amp; 2 &lt; 2
otherwise
with 1 the distance between the target and the delivery zone, 2 the distance between
the target and the interception zone, and 1 and 2 their respective minimum since the
episode’s beginning. The agent is rewarded when 2 is reduced. To avoid the degenerate
case where the attacker makes the target move away and closer continuously, the value of the
reward at each step is increased as long as the target gets closer to the interception zone. It is
reset to 0 otherwise. This reward model choice encourages the agent to divert the target with
the most direct trajectory possible. Figure 4 shows two visualizations of the simulator.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental results and discussion</title>
      <p>
        This section exhibits the eficiency of RL agents in hijacking the target delivery drone and
discusses the diferent factors impacting the success of an interception. For the sake of
simplicity, we fixed the target’s velocity to 400 ft/s. We trained three attacker agents 300,
600 and 1000 capable of reaching the maximum speed of 300 ft/s, 600 ft/s, and 1000 ft/s
(75%, 150% and 250% of the target’s speed) respectively. We used stable-baselines3 [
        <xref ref-type="bibr" rid="ref55">55</xref>
        ]
implementation of TD3 to train the attackers. Each agent was composed of an actor and
two critics, and we used a two-layer feedforward neural network of 400 and 300 hidden
nodes for both the actor and the critics. TD3 is an of-policy
algorithm, during training
transitions (, , +1, +1) are stored in a replay-bufer
[
        <xref ref-type="bibr" rid="ref52">52</xref>
        ] and drawn randomly in the
form of mini-batches during the weight update phase. We used a replay bufer of size
and a mini-batch size of 512. As suggested in [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], we added an action noise drawn from
the Ornstein-Uhlenbeck process to promote exploration. We trained all agents for 6
million steps. Table 2 provides a complete description of the hyper-parameters used during training.
sampling on 105 simulations varying all the initial positions in each episode. Results reported
in Table 3 show that the interception of the target is possible, but highly dependent on the
attacker’s speed relative to the target’s. The 1000 agent obtains a 94% success rate on its attacks.
Conversely, the 300 agent only very rarely succeeds in performing an interception. The 600
agent manages to intercept the target in only 34% of the scenarios, although it goes slightly
faster than the target.
      </p>
      <p>Impact of the speed. The previous experiment showed the importance of speed on the
success rate of the attack. In some situations, a too low velocity does not allow the attacker
to position himself in such a way that the ACAS-Xu detects a possible collision and thus have
an opportunity to interfere with his target. We conducted a similar Monte Carlo experiment
to evaluate the attacker’s area of influence according to its speed, fixing the delivery area’s
position and its distance to the target. We compared the estimated influence area with a
simplified theoretical one consisting of a disk  centered on the delivery area, with a radius</p>
      <p>1 the distance between the target and the delivery area weighted by the ratio
 =</p>
      <p>between the maximum speed of the attacker and the speed of the target. Figure 5 shows the
estimated areas for the three agents. The agents 600 and 1000 almost always succeed in
hijacking the target in the theoretical influence area. Their estimated area is even slightly
larger than the theoretical one. Conversely, agent 300 almost never succeeds in intercepting
the target, even in theoretically favorable situations. This reveals that the ratio between the
speeds of the attacker and his target does not only impact the attacker’s surface of influence.
When the attacker’s maximum speed is lower than the target’s, the agent cannot influence the
trajectory suficiently. We demonstrate the need for the attacker to be able to fly faster than his
target by looking at an example of a successful hijacking on the agent 600. Figure 6 shows the
trajectory of the attacker and its speed variations during a successful attack. For this example,
the agent first heads with maximum speed towards the target to penetrate its close vicinity.
In this example, the agent first heads towards the target at maximum speed to penetrate its
immediate vicinity. In a second time, he approaches his target’s speed to be able to divert it
with more precision. We can then observe through some peaks that the agent sometimes needs
to suddenly increase its speed above 400ft/s to keep control over the target. Figure 7 shows an
example of attempt of hijacking when the attackers are in the close vicinity of the target. The
fastest agent 1000 smoothly divert the target towards the interception area while the 600
agent succeeds in his attack in a less straightforward fashion. The 300 agent fails in diverting
the target. It is easily outpaced by the target which avoids it before resuming its trajectory
towards the delivery zone.</p>
      <p>Impact of the geometry. With speed much higher than the target, the agent 1000 almost
systematically succeeds in the interception. We can distinguish two cases where the hijacking
does not succeed despite the speed diference. Figure 8 shows examples of these situations. The
ifrst group consists of situations where the target is already too close to the delivery area and
where the agent cannot interact quickly enough with the target. The second group involves
situations where the delivery area is located between the target and the interception zone. It
may happen, in these situations, that the policy learned by the attacker does not succeed in
deviating the target’s trajectory enough to prevent it from reaching the delivery zone. We
believe that fine-tuning the reward function would improve the agent’s performance in these
situations. Since our goal is to demonstrate the feasibility of the attack and not to obtain a
better score, we consider the search for the best policy outside the scope of this work.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        This work highlights a security flaw in automatic collision avoidance systems designed to equip
future unmanned aerial vehicles. Through the example of a delivery drone equipped with the
ACAS-Xu avoidance system, we demonstrate the possibility of diverting it from its trajectory to
an interception zone using an agent trained with Reinforcement Learning. Although we made
simplifying assumptions in modeling the problem (a single avoidance system on a horizontal
plan), we believe that our method would generalize to other configurations with additional
degrees of freedom. Our contribution is not limited to a new hacking method. We believe that
it demonstrates the efectiveness of Reinforcement Learning in finding security holes for these
autonomous systems, and thus opens the door to future certification processes based on RL
adaptive stress testing. Since these subjects are essential for accepting and developing future
autonomous vehicles, we make our simulator available to the community. In a future work, we
consider training both agents with Reinforcement Learning in a zero-sum two-player game (as
in [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ]) to produce collision avoidance policies robust to malicious attacks.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>This project received funding from the French "Investing for the Future – PIA3" program
within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully
acknowledge the support of the DEEL project2.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] Code of Federal Regulations CFR 91</source>
          .221,
          <article-title>General operating and flights rules, Subpart C-Equipment</article-title>
          , Instrument, and Certificate Requirements,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] Code of Federal Regulations CFR 135.180</source>
          ,
          <string-name>
            <surname>Operating</surname>
            <given-names>requirements</given-names>
          </string-name>
          <article-title>: commuter and on demand operations and rules governing persons on board such aircraft; Subpart c - aircraft</article-title>
          and equipment ;
          <volume>135</volume>
          .180
          <string-name>
            <given-names>Trafic</given-names>
            <surname>Alert</surname>
          </string-name>
          and Collision Avoidance System,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>TSO-C119B</surname>
          </string-name>
          ;
          <article-title>Trafic Alert and Collision Avoidance System (TCAS) Airborne Equipment, TCAS II</article-title>
          .,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] RTCA - DO-185A Minimum Operational Performance Standards for Trafic Alert</article-title>
          and
          <string-name>
            <surname>Collision Avoidance System II (TCAS II) Airborne Equipment Volume</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Teso</surname>
          </string-name>
          ,
          <article-title>Aircraft hacking, 4th annual Hack in the Box (HITB</article-title>
          ) Security Conference in Amsterdam (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sathaye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schepers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ranganathan</surname>
          </string-name>
          , G. Noubir,
          <article-title>Wireless attacks on aircraft instrument landing systems</article-title>
          ,
          <source>in: 28th {USENIX} Security Symposium ({USENIX} Security 19)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>357</fpage>
          -
          <lpage>372</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Akram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Markantonakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Holloway</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kariyawasam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ayub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seeam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Atkinson</surname>
          </string-name>
          ,
          <article-title>Challenges of security and trust in avionics wireless networks</article-title>
          ,
          <source>in: 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>4B1</fpage>
          -
          <lpage>1</lpage>
          -4B1-12. doi:
          <volume>10</volume>
          .1109/DASC.
          <year>2015</year>
          .
          <volume>7311416</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>R. santamarta</surname>
          </string-name>
          ,
          <article-title>Arm ida and cross check: Reversing the 787's core network</article-title>
          , https://act-on. ioactive.com/acton/attachment/34793/f-cd239504
          <string-name>
            <surname>-</surname>
          </string-name>
          44e6
          <string-name>
            <surname>-</surname>
          </string-name>
          42ab
          <string-name>
            <surname>-</surname>
          </string-name>
          85ce-91087de817d9/1/-/-/-/ -/Arm-IDA
          <source>%20and%20Cross%20Check%3A%20Reversing%20the%20787%27s%20Core% 20Network.pdf</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kochenderfer</surname>
          </string-name>
          , J. E. Holland,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Chryssanthacopoulos</surname>
          </string-name>
          ,
          <article-title>Next-generation airborne collision avoidance system</article-title>
          ,
          <source>Technical Report</source>
          , Massachusetts Institute of TechnologyLincoln Laboratory Lexington United States,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>EUROCAE</surname>
            <given-names>WG</given-names>
          </string-name>
          <year>75</year>
          .1 /RTCA SC-
          <volume>147</volume>
          ,
          <string-name>
            <given-names>Minimum</given-names>
            <surname>Operational Performance Standards For Airborne Collision Avoidance System Xu (ACAS Xu)</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Netjasov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vidosavljevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tosic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Everdij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Blom</surname>
          </string-name>
          ,
          <article-title>Development, validation and application of stochastically and dynamically coloured petri net model of acas operations for safety assessment purposes, Transportation Research part C: emerging technologies 33 (</article-title>
          <year>2013</year>
          )
          <fpage>167</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>J.-B. Jeannin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ghorbal</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kouskoulas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Zawadzki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Platzer</surname>
          </string-name>
          ,
          <article-title>Formal verification of acas x, an industrial airborne collision avoidance system</article-title>
          ,
          <source>in: 2015 International Conference on Embedded Software (EMSOFT)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>136</lpage>
          . doi:
          <volume>10</volume>
          . 1109/EMSOFT.
          <year>2015</year>
          .
          <volume>7318268</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Lahsen-Cherif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lamy-Bergot</surname>
          </string-name>
          ,
          <article-title>Real-time drone anti-collision avoidance systems: an edge artificial intelligence application</article-title>
          ,
          <source>in: 2022 IEEE Radar Conference (RadarConf22)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>K. D. Julian</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lopezy</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Brushy</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Owenz</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <string-name>
            <surname>Kochenderfer</surname>
          </string-name>
          ,
          <article-title>Deep neural network compression for aircraft collision avoidance systems</article-title>
          ,
          <source>35th Digital Avionics Systems Conference (DASC)</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <article-title>Approximate bisimulation relations for neural networks and application to assured neural network compression</article-title>
          ,
          <source>arXiv preprint arXiv:2202.01214</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Dill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Julian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kochenderfer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Reluplex:</surname>
          </string-name>
          <article-title>An eficient SMT solver for verifying deep neural networks</article-title>
          ,
          <source>CoRR abs/1702</source>
          .01135 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Clavière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Asselin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Garion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pagetti</surname>
          </string-name>
          ,
          <article-title>Safety verification of neural network controlled systems</article-title>
          ,
          <source>in: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Damour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Grancey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gabreau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gaufriau</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Ginestet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hervieu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Huraux</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Pagetti</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Ponsolle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clavière</surname>
          </string-name>
          ,
          <article-title>Towards certification of a reduced footprint acas-xu system: A hybrid ml-based solution</article-title>
          , in: International Conference on Computer Safety, Reliability, and Security, Springer,
          <year>2021</year>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          , et al.,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          , nature
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , I. Babuschkin,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Czarnecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dudzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Powell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ewalds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          , et al.,
          <article-title>Grandmaster level in starcraft ii using multi-agent reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>575</volume>
          (
          <year>2019</year>
          )
          <fpage>350</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Maddison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Van Den Driessche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schrittwieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Panneershelvam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          , et al.,
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          ,
          <source>nature</source>
          <volume>529</volume>
          (
          <year>2016</year>
          )
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hubert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schrittwieser</surname>
          </string-name>
          , I. Antonoglou,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graepel</surname>
          </string-name>
          , et al.,
          <article-title>A general reinforcement learning algorithm that masters chess, shogi, and go through self-play</article-title>
          ,
          <source>Science</source>
          <volume>362</volume>
          (
          <year>2018</year>
          )
          <fpage>1140</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schrittwieser</surname>
          </string-name>
          , I. Antonoglou,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hubert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lockhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hassabis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Graepel</surname>
          </string-name>
          , et al.,
          <article-title>Mastering atari, go, chess and shogi by planning with a learned model</article-title>
          ,
          <source>Nature</source>
          <volume>588</volume>
          (
          <year>2020</year>
          )
          <fpage>604</fpage>
          -
          <lpage>609</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Dorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Bishop</surname>
          </string-name>
          ,
          <article-title>Modern control systems</article-title>
          , Pearson Prentice Hall,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Waslander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Tomlin</surname>
          </string-name>
          <article-title>, Multi-agent quadrotor testbed control design: Integral sliding mode vs. reinforcement learning</article-title>
          ,
          <source>in: 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems</source>
          , IEEE,
          <year>2005</year>
          , pp.
          <fpage>3712</fpage>
          -
          <lpage>3717</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>K. D. Young</surname>
            ,
            <given-names>V. I.</given-names>
          </string-name>
          <string-name>
            <surname>Utkin</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Ozguner</surname>
          </string-name>
          ,
          <article-title>A control engineer's guide to sliding mode control</article-title>
          ,
          <source>IEEE transactions on control systems technology 7</source>
          (
          <year>1999</year>
          )
          <fpage>328</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hwangbo</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siegwart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Control of a quadrotor with reinforcement learning</article-title>
          ,
          <source>IEEE Robotics and Automation Letters</source>
          <volume>2</volume>
          (
          <year>2017</year>
          )
          <fpage>2096</fpage>
          -
          <lpage>2103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>W.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mancuso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>West</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bestavros</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for uav attitude control</article-title>
          ,
          <source>ACM Transactions on Cyber-Physical Systems</source>
          <volume>3</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Hunt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pritzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tassa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <article-title>Continuous control with deep reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1509.02971</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <article-title>Trust region policy optimization</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1889</fpage>
          -
          <lpage>1897</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          ,
          <source>arXiv preprint arXiv:1707.06347</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>P.</given-names>
            <surname>Becker-Ehmck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. van der Smagt</surname>
          </string-name>
          ,
          <article-title>Learning to fly via deep model-based reinforcement learning</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>08876</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>H. X.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>La</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Feil-Seifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Autonomous uav navigation using reinforcement learning</article-title>
          , arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>05086</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Watkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dayan</surname>
          </string-name>
          ,
          <article-title>Q-learning</article-title>
          ,
          <source>Machine learning 8</source>
          (
          <year>1992</year>
          )
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Biyue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wenbo</surname>
          </string-name>
          ,
          <article-title>Uav navigation in high dynamic environments: A deep reinforcement learning approach</article-title>
          ,
          <source>Chinese Journal of Aeronautics</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>479</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>B.</given-names>
            <surname>Vlahov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Squires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Strickland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pippin</surname>
          </string-name>
          ,
          <article-title>On developing a uav pursuit-evasion policy using reinforcement learning</article-title>
          ,
          <source>in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>859</fpage>
          -
          <lpage>864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Badia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <article-title>Asynchronous methods for deep reinforcement learning</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1928</fpage>
          -
          <lpage>1937</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Shaw</surname>
          </string-name>
          , Fighter combat, Tactics and Maneuvering; Naval Institute Press: Annapolis,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , USA (
          <year>1985</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xuan</surname>
          </string-name>
          , L. Ke,
          <article-title>Uav swarm attack-defense confrontation based on multi-agent reinforcement learning</article-title>
          ,
          <source>in: Advances in Guidance, Navigation and Control</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>5599</fpage>
          -
          <lpage>5608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. WU</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Mordatch, Multi-agent actor-critic for mixed cooperative-competitive environments</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          )
          <fpage>6379</fpage>
          -
          <lpage>6390</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>M.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <article-title>Learning-based defense against malicious unmanned aerial vehicles</article-title>
          ,
          <source>in: 2018 IEEE 87th Vehicular Technology Conference (VTC Spring)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gnanasekera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Savkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Katupitiya</surname>
          </string-name>
          ,
          <article-title>Range measurements based uav navigation for intercepting ground targets</article-title>
          ,
          <source>in: 2020 6th International Conference on Control, Automation and Robotics (ICCAR)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>468</fpage>
          -
          <lpage>472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>E.</given-names>
            <surname>Çetin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barrado</surname>
          </string-name>
          , E. Pastor,
          <article-title>Counter a drone in a complex neighborhood area by deep reinforcement learning</article-title>
          ,
          <source>Sensors</source>
          <volume>20</volume>
          (
          <year>2020</year>
          )
          <fpage>2320</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44] Y. Cheng, Y. Song,
          <article-title>Autonomous decision-making generation of uav based on soft actorcritic algorithm</article-title>
          ,
          <source>in: 2020 39th Chinese Control Conference (CCC)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>7350</fpage>
          -
          <lpage>7355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>T.</given-names>
            <surname>Haarnoja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <surname>Soft</surname>
          </string-name>
          actor-critic:
          <article-title>Of-policy maximum entropy deep reinforcement learning with a stochastic actor</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1861</fpage>
          -
          <lpage>1870</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kochenderfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. J.</given-names>
            <surname>Mengshoel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Brat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Owen</surname>
          </string-name>
          ,
          <article-title>Adaptive stress testing of airborne collision avoidance systems</article-title>
          ,
          <source>in: 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC)</source>
          , IEEE,
          <year>2015</year>
          , pp.
          <fpage>6C2</fpage>
          -
          <lpage>1</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>R.</given-names>
            <surname>Coulom</surname>
          </string-name>
          ,
          <article-title>Eficient selectivity and backup operators in monte-carlo tree search</article-title>
          ,
          <source>in: International conference on computers and games</source>
          , Springer,
          <year>2006</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>G.</given-names>
            <surname>Manfredi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jestin</surname>
          </string-name>
          ,
          <article-title>An introduction to acas xu and the challenges ahead</article-title>
          ,
          <source>in: 35th Digital Avionics Systems Conference (DASC'16)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>McAllester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mansour</surname>
          </string-name>
          , et al.,
          <article-title>Policy gradient methods for reinforcement learning with function approximation</article-title>
          ., in: NIPs, volume
          <volume>99</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>1999</year>
          , pp.
          <fpage>1057</fpage>
          -
          <lpage>1063</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          , G. Lever,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Degris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Deterministic policy gradient algorithms</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>387</fpage>
          -
          <lpage>395</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Playing atari with deep reinforcement learning</article-title>
          ,
          <source>arXiv preprint arXiv:1312.5602</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>H.</given-names>
            <surname>Van Hasselt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning with double q-learning</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>30</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meger</surname>
          </string-name>
          ,
          <article-title>Addressing function approximation error in actor-critic methods</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1587</fpage>
          -
          <lpage>1596</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rafin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ernestus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gleave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kanervisto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dormann</surname>
          </string-name>
          , Stable baselines3, https://github.com/DLR-RM/stable-baselines3,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>