Autonomous drone interception with Deep Reinforcement Learning David Bertoin†1,2,3 , Adrien Gauffriau†1,4 , Damien Grasset†5 and Jayant Sen Gupta1,6 1 IRT Saint-Exupery, Toulouse, France 2 ISAE-SUPAERO, Toulouse, France 3 Institut de Mathematiques de Toulouse, France 4 Airbus Operations, Toulouse, France 5 IRT Saint-Exupery Montreal, Canada 6 Airbus AI Research, Toulouse, France Abstract Driven by recent successes in artificial intelligence, new autonomous navigation systems are emerging in the urban space. The adoption of such systems raises questions about certification criteria and their vulnerability to external threats. This work focuses on the automated anti-collision systems designed for autonomous drones evolving in an urban context, less controlled than the conventional airspace and more vulnerable to potential intruders. In particular, we highlight the vulnerabilities of such systems to hijacking, taking as example the scenario of an autonomous delivery drone diverted from its mission by a malicious agent. We demonstrate the possibility of training Reinforcement Learning agents to deflect a drone equipped with an automated anti-collision system. Our contribution is threefold. Firstly, we illustrate the security vulnerabilities of these systems. Secondly, we demonstrate the effectiveness of Reinforcement Learning for automatic detection of security flaws. Thirdly, we provide the community with an original benchmark based on an industrial use case. Keywords Automated anti-collision system, Autonomous Drone, Reinforcement Learning, Urban Mobility, UAV, Security 1. Introduction Spurred on by the recent advances in Machine Learning, autonomous vehicles are destined to become an integral part of tomorrow’s mobility. So far limited to land conveyance, a new category of autonomous vehicles (e.g., air cabs, delivery drones) is about to take place in the urban mobility space. In a similar way to self-driving cars, unmanned aerial vehicles (UAVs) cause new public acceptance and certification challenges. Guaranteeing the non-collision of these UAVs with the obstacles present in the urban landscape and other flying objects is part of these challenges. If flight plans and airways are used to prevent collisions in the traditional airspace, they most often require human supervision. The potential amount of UAVs circulating simultaneously in the urban air mobility contexts makes their utilization unfeasible. Thus, ATT’22: Workshop Agents in Traffic and Transportation, July 25, 2022, Vienna, Austria $ david.bertoin@irt-saintexupery.com (D. Bertoin†); adrien.gauffriau@airbus.com (A. Gauffriau†); damien.grasset@irt-saintexupery.com (D. Grasset†); jayant.sengupta@airbus.com (J. S. Gupta) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) David Bertoin† et al. CEUR Workshop Proceedings many UAVs are currently equipped with automatic collision avoidance systems. These systems are subject to drastic controls ([1], [2] [3], [4]) to obtain the certifications allowing them to be integrated into UAVs. Although these certifications guarantee these systems’ proper functioning and efficiency, the deterministic aspect of their decisions makes their behavior predictable. We argue that the predictability of their behaviors exposes ownships equipped with such systems to malicious attacks. In this paper, we take advantage of Reinforcement Learning (RL) agents’ exploration and exploitation capabilities to demonstrate the possibility of diverting an autonomous drone to an interception zone by hacking its obstacle avoidance system. We implement a simulator based on the scenario of an autonomous delivery drone trying to reach a delivery area and being hijacked by another drone. We use RL to train an autonomous attacker to position itself to deceive the anti-collision system and therefore, modify his target’s trajectory towards the desired zone. Our work thus highlights a critical security flaw of UAVs and opens the way to the use of RL in the certification procedures of such systems. This paper is organized as follows. Section 2 presents the related works. Section 3 introduces background knowledge on the ACAS-X anti-collision system family and Reinforcement Learning. Section 4 presents the general interception scenario and the simulator design. Section 5 shows empirical results and discusses the critical components of successful hijackings. Finally, Section 6 summarizes and concludes this paper on future perspectives. 2. Related works Security in avionics. Ongoing research into security vulnerabilities plays a fundamental role in the fight against hijacking. In [5], the authors exploit vulnerabilities of the Aircraft Communication Addressing and Reporting System (ACARS) network to upload new flight plans in the aircraft’s Flight Management System (FMS). Nevertheless, since the pilots still carry out the control of the aircraft, the attack mainly leads to an increase in their workload. [6] directly targets the ground infrastructure and designs two different attacks, using radio signals, on the Instrument Landing System (ILS). The authors develop two different attacks on the ILS using radio signals that prevent the aircraft from landing successfully. They demonstrate a consistent success rate with offset touchdowns from 18 to over 50 meters laterally and longitudinally. [7] highlights the vulnerabilities of the Avionics Wireless Networks (AWN). [8] describes a detailed, plausible attack reaching the avionics network effectively on a commercial airplane. Due to the complexity and cost of deploying such attacks, most of these results remain theoretical. To date, none of these attacks enable complete control of the aircraft. Their impact is often mitigated by the presence of humans in the loop. In the future, the introduction of low complex autonomous systems, combined with the popularization of drones and their accessibility, may change this paradigm. Our work focuses on this new generation of UAVs, which is about to occupy the urban airspace, and highlights their vulnerabilities to malicious attacks. Automatic anti-collision systems Although supplemented by pilots’ interventions, automated collision detection and avoidance plays a crucial role in the safety of flying vehicles. These systems are mainly used for airliners to provide maneuvering recommendations to the pilot and thus lighten his general workload. The most commonly used is the Traffic Collision David Bertoin† et al. CEUR Workshop Proceedings Avoidance System (TCAS). Unlike newer systems, the TCAS requires both aircrafts to be equipped to be effective. Based on [9], the Airborne Collision Avoidance System X (ACAS X) [10] has been designed for autonomous vehicles. The ACAS systems rely upon probabilistic models to represent the various sources of uncertainty and upon computer-based optimization to provide the best possible collision avoidance recommendations. Its robustness has been demonstrated notably in [11] using Petri model and [12] using formal methods. Within the ACAS X family, the ACAS-Xu is specifically designed for unmanned aircraft systems (UAS). Due to the UAS reduced capabilities, this variant solves conflicts using horizontal motions. However, the cost table size (∼ 5GB) makes it challenging to be embedded within small UAVs. [13], [14], and [15] study the possibility to replace these look-up tables with surrogate models to reduce the system’s footprint. Nonetheless, the use of surrogates raises the problem of their certification. [16],[17], and [18] employ formal methods to provide guarantees on these models. Our contribution highlights security flaws of the ACAS-Xu but could as well be extended to other ACAS-X versions. RL for Unmanned Aerial Vehicle. Recent successes in Reinforcement Learning have demon- strated the ability of autonomous agents to outperform humans in many tasks [19, 20, 21, 22, 23] and several works have applied it for the control and navigation of UAVs. RL provides a promising alternative to classical stability and control methods such as Proportional-Integral-Derivative (PID) control systems [24]. While PID control has demonstrated excellent results in stable environments, it is less effective in unpredictable and harsh environments. Recently, several research projects have explored the possibility of using Reinforcement Learning to address its limitations. [25] compares the efficiency of a model based Reinforcement Learning controller with Integral Sliding Mode (ISM) control [26]. The authors of [27] train a neural network policy for quadrotor controllers using an original policy optimization algorithm with Monte-Carlo estimates. The learned policy manages to stabilize the quadrotor in the air even under very harsh initializations, both in simulation and with a real quadrotor. [28] trains autonomous controllers flight control systems with state-of-the-art model-free deep Reinforcement Learning algorithms (Deep Deterministic Policy Gradient [29], Trust Region Policy Optimization [30], Proximal Policy Optimization [31]) and compares their performance with PID controllers. In [32], a sequential latent variable model is learned from flying sequences of an actual drone controlled with PID. This latent dynamic model is used as a generative model to learn a deep model-based RL agent directly on real drones with a limited number of steps. Mission planning is another classical use case of RL for UAVs. In [33], the authors combine a Q-learning [34] algorithm focusing on navigation policy with PID controllers. In [35], the navigation problem is decomposed into two simpler sub-tasks (collision-avoidance and ap- proaching the target), each of them solved by a separate neural network in a distributed deep RL framework. An active field of research focuses on interception and defense against malicious drones. In a 1 vs 1 close combat situation, [36] demonstrates the effectiveness of an A3C [37] RL agent versus an opponent with Greedy Shooter policy [38]. In a multi-agent context, [39] uses a Multi-Agent Deep Deterministic Policy Gradient algorithm (MADDPG) [40] in an attack-defense confrontation Markov game. [41] proposes a ground defense system trained with Q-learning to choose between high-level defense strategies (GPS spoofing, jamming, hacking, and laser shooting). While [42] and [43] use RL to train a drone attacker to intercept a target drone, [44] David Bertoin† et al. CEUR Workshop Proceedings places the agent in the defender’s position and trains it with a Soft Actor-Critic algorithm [45] to avoid capture. [46] propose an adaptive stress testing based on Monte Carlo Tree Search [47] to find the most likely state trajectories leading to near mid-air collisions. Our contribution also aims at intercepting a target UAV using an RL agent. However, it differs on two significant points. First, it highlights the security flaws of a deterministic policy dictated by the ACAS-Xu system for collision avoidance. Second, our attacker does not seek to capture the target directly but to guide it to a specific area where it can potentially be captured. This strategy does not require any attack equipment directly implemented on the attacking UAV and can easily be applied to any UAV. 3. Background 3.1. Overview of the ACAS system The Airborne Collision Avoidance System X (ACAS X) is a collection of rules providing an optimized decision logic for airborne collision avoidance. Among the family of ACAS X, the ACAS-Xu is specifically designed for drones and urban air mobility. By requesting a set of lookup tables (LUT) computed offline, the ACAS-Xu provides, in real-time, a horizontal resolution of conflicts. Using data collected from its surrounding environment, the ownship queries the LUT on the collisions probabilities for five different directional recommendations: COC (Clear of Conflict for the current heading), WL (Weak Left), WR (Weak Right), L (Left), R (Right). Therefore, the ACAS-Xu is used to avoid any static object (tower, crane, antenna) or any moving object (e.g., birds, UAV) whose behavior could be unpredictable or even using the same avoidance system. No particular maneuver is required when the ownship is in the COC state. Otherwise, the autonomous agent may decide to follow the advisory that minimizes the probability of conflicts according to the geometric parameters. We describe, in Table 1, these parameters along with their unit of measure and illustrate in Figure 1 the example of an intruder crossing the ownship’s trajectory. Table 1 Geometric parameters Param. (units) Description 𝜌 (ft) Distance from ownship to intruder Angle between ownship and intruder 𝜃 (rad) relative to the ownship’s heading Heading angle of intruder 𝜓 (rad) relative to the ownship’s heading 𝑣𝑜𝑤𝑛 (ft/s) Speed of the ownship 𝑣𝑖𝑛𝑡 (ft/s) Speed of the intruder Figure 1: ACAS-Xu geometry [16] A complete description of the ACAS-Xu system can be found in [48]. David Bertoin† et al. CEUR Workshop Proceedings 3.2. Reinforcement Learning Within the field of Machine Learning, Reinforcement Learning is particularly suited for se- quential decision-making problems. In RL, an agent interacts with its environment during a sequence of discrete time steps, 𝑡 = 0, 1, 2, 3, .... At each time step 𝑡, the agent is provided a representation of the environment state 𝑠𝑡 ∈ 𝒮, where 𝒮 is the space of all possible states. Considering 𝑠𝑡 , the agent takes an action 𝑎𝑡 ∈ 𝒜, where 𝒜 is the space of all possible actions (including inaction). Performing this action in the environment causes the environment to transition from 𝑠𝑡 to 𝑠𝑡+1 , and as a consequence of this transition, the agent receives a numerical reward 𝑟𝑡 ∈ R. Agent 𝑠𝑡+1 𝑟𝑡 𝑎𝑡 Environment Figure 2: Agent-environment interaction loop Figure 2 taken from [49] illustrates the agent-environment interactions. The mapping of a state 𝑠 to a probability of taking each possible action in 𝒜 is called the agent’s policy and denoted 𝜋(𝑎|𝑠) = P[𝐴𝑡 |𝑆𝑡 = 𝑠]. Considering ∑︀𝐾 a discount factor 𝛾 ∈ [0, 1], the return is defined as the discounted sum of rewards 𝑅𝑡 = 𝑘=0 𝛾 𝑟𝑡+𝑘 . Deep Reinforcement Learning’s goal (𝑘) consists in finding the policy 𝜋𝜑 , represented by a neural network with parameters 𝜑, that maximizes the expected return 𝐽 (𝜋𝜑 ) = E [𝑅(𝜏 )] with 𝜏 = (𝑠0 , 𝑎0 , ..., 𝑠𝐾+1 ) the trajectory 𝜏 ∼𝜋𝜑 obtained by following the policy 𝜋𝜑 starting from state 𝑠0 . For continuous control problems (such as motor speed control), policy gradients methods aim at learning a parameterized policy 𝜋𝜑 through gradient ascent on 𝐽(𝜋𝜑 ). These methods rely on the policy gradient theorem [50]: ∇𝜑 𝐽(𝜑) = E𝑠∼𝜌𝜋 ,𝑎∼𝜋𝜑 [∇𝜑 log 𝜋𝜑 (𝑎 | 𝑠)𝑄𝜋 (𝑠, 𝑎)] , where 𝑄𝜋 (𝑠, 𝑎) = E𝑠∼𝜌𝜋 ,𝑎∼𝜋𝜑 [𝑅𝑡 |𝑠, 𝑎] is the action-value function. 𝑄𝜋 (𝑠, 𝑎) represents the expected return of performing action 𝑎 in state 𝑠 and following 𝜋 afterwards. Policy gradients methods typically require an estimate of 𝑄𝜋 (𝑠, 𝑎). An approach used in actor-critic methods consists in using a parameterized estimator called critic to estimate 𝑄𝜋 (𝑠, 𝑎) (𝜋𝜑 thus represents the actor part of the agent). By relying on this principle, [51] propose the deterministic policy gradient algorithm to compute ∇𝜑 𝐽(𝜑): [︁ ]︁ ∇𝜑 𝐽(𝜑) = E𝑠∼𝑝𝜋 ∇𝑎 𝑄𝜋 (𝑠, 𝑎)|𝑎=𝜋(𝑠) ∇𝜑 𝜋𝜑 (𝑠) . The Deep Deterministic Policy Gradient (DDPG) algorithm [29] adapts the ideas underlying the success of Deep Q-Learning [52] [19] to estimate 𝑄𝜋 with a neural network with parameters 𝜃. In DDPG the learned Q-function tends to overestimate 𝑄𝜋 (𝑠, 𝑎), thus leading to the policy exploiting the Q-function estimation errors. Inspired by the Double Q-learning [53], the Twin David Bertoin† et al. CEUR Workshop Proceedings Delayed DDPG (TD3) [54] addresses this overestimation by taking the minimum estimation between a pair of critics and adding noise to the actions used to form the Q-learning target. Combined with a less frequent policy update (one update every 𝑑 critic updates), these tricks result in substantially improved performance over DDPG in a number of challenging tasks in the continuous control setting. Algorithm 1 describes TD3’s training procedure. Algorithm 1: TD3 Initialize critic networks 𝑄𝜃1 , 𝑄𝜃2 , and actor network 𝜋𝜑 with random parameters 𝜃1 , 𝜃2 , 𝜑 Initialize target networks 𝜃1′ ← 𝜃1 , 𝜃2′ ← 𝜃2 , 𝜑′ ← 𝜑 Initialize replay buffer ℬ for 𝑡 = 1 to 𝑇 do Select action with exploration noise 𝑎 ∼ 𝜋𝜑 (𝑠) + 𝜖, 𝜖 ∼ 𝒩 (0, 𝜎) and observe reward 𝑟 and new state 𝑠′ Store transition tuple (𝑠, 𝑎, 𝑟, 𝑠′ ) in ℬ Sample mini-batch of 𝑁 transitions (𝑠, 𝑎, 𝑟, 𝑠′ ) from ℬ 𝑎˜ ← 𝜋𝜑′ (𝑠′ ) + 𝜖, 𝜖 ∼ clip(𝒩 (0, 𝜎 ˜ ), −𝑐, 𝑐) 𝑦 ← 𝑟 + 𝛾min𝑖=1,2 𝑄𝜃𝑖′ (𝑠′ , 𝑎 ˜) Update critics 𝜃𝑖 ← argmin 𝜃𝑖 𝑁1 (𝑦 − 𝑄𝜃𝑖 (𝑠, 𝑎))2 ∑︀ if 𝑡 mod 𝑑 then Update 𝜑 by the ∑︀deterministic⃒⃒policy gradient: ∇𝜑 𝐽(𝜑) = 𝑁1 ∇𝑎 𝑄𝜃1 (𝑠, 𝑎) 𝑎=𝜋 (𝑠) ∇𝜑 𝜋𝜑 (𝑠) 𝜑 Update target networks: 𝜃𝑖′ ← 𝜏 𝜃𝑖 + (1 − 𝜏 )𝜃𝑖′ 𝜑′ ← 𝜏 𝜑 + (1 − 𝜏 )𝜑′ end end 4. Drone interception scenario We consider an environment composed of two agents (𝑇 , 𝐴) and two areas of interest (𝐷1 , 𝐷2 ) inside a delimited playground. The first agent 𝑇 , which we refer to as the target, is a delivery drone that has the mission to reach the delivery area 𝐷1 . This target drone is equipped with the ACAS-Xu system, introduced in section 3.1. The target starts in position 𝐼𝑡 , has a fixed velocity 𝑣𝑡 , and uses the autonomous avoidance system to modify its trajectory when a possible conflict is detected. The second agent 𝐴, referred to as the attacker, aims at hijacking the target towards an alternate delivery area, called the interception area 𝐷2 . To do so, the attacker exploits the potential flaws induced by the target’s use of the avoidance ACAS-Xu system to deflect its trajectory toward the interception area 𝐷2 . The attacker starts in position 𝐼𝑎 and can continuously adjust its velocity 𝑣𝑎 within the range [0, 𝑣𝑚𝑎𝑥 ]. Both agents interact in a 2-dimensional playground. Gravity is not considered, and the scene does not contain any obstacles. Figure 3 provides a schematic representation of the set-up. David Bertoin† et al. CEUR Workshop Proceedings 𝐼𝑎 𝐷1 A 𝐷2 Target’s heading ACAS advisory T 𝐼𝑡 Figure 3: The interception scenario: The target heads toward its delivery zone while the attacker seeks to divert it to the interception zone by taking advantage of the recommendations given by the ACAS-Xu. One can note that the Xu version of the ACAS system is dedicated to drones and only provides horizontal avoidance recommendations. Among the other versions of the ACAS, the Xa version is dedicated to large aircraft and provides vertical avoidance. For the sake of simplicity, we restrict the interception to a horizontal plan and the use of the ACAS-Xu avoidance system solely. Adding a vertical dimension would not necessarily complexify the learning task (given some adjustments to the reward model). We let the study of a 3D interception with a target equipped with both the Xu and the Xa version to future work. The simulator environment We implemented this scenario in an original open-source simulator1 . In the following, we describe its main characteristics. State Space : The state of the environment at step 𝑛 is fully described by the state of the two agents (target and attacker) as well as the cartesian positions of the delivery and interception areas. Each agent’s state is composed of its cartesian positions 𝑃𝑛𝑎𝑔𝑒𝑛𝑡 = (𝑥, 𝑦) and its velocity vector in the horizontal plan 𝑉⃗ 𝑎𝑔𝑒𝑛𝑡 . We denote by 𝛼 (resp. V), the heading’s angle (resp. the 𝑛 norm) of the velocity vector, and use 𝑉 ⃗ = (𝛼, 𝑉 ) as its representation. The last recommendation 𝐴𝑋𝑛−1 of the ACAS-Xu system is also included in the state. Action Space : The attacker’s action vector at step 𝑛 is composed of two updates 𝛿𝑉𝑛 ∈ [−200, +200] (ft/s) and 𝛿𝛼𝑛 ∈ [− 𝜋2 , 𝜋2 ] (rad) representing respectively an update of its velocity and its heading. Transitions: The target agent’s velocity is constant during the whole episode. Its heading is updated according to the ACAS-Xu recommendations. If the advisory provided is different from COC, the following heading update 𝛿𝛼𝑛 is performed: WL → + 0.15 (rad), WR → - 0.15 (rad), L → + 0.3 (rad) and R → - 0.3 (rad). If the advisory is COC, the 𝜆𝛼 update allows the target to reach the optimal heading pointing to the delivery area with a maximum variation of 0.3 (rad). This variation is constrained to represent actual drone maneuverability and avoid instability 1 Since the ACAS-Xu lookup tables are exclusively available to organizations that are part of RTCA/EUROCAE (the authority responsible for the standardization of ACAS-Xu), a mode allowing to run the simulator with a surrogate model emulating the tables is available. Link available by the time of the conference. David Bertoin† et al. CEUR Workshop Proceedings (a) Attacker’s point of view (b) Top view Figure 4: The simulator environment due to big turns. Both agents’ positions 𝑃 are updated by 𝑃𝑛+1 = 𝑃𝑛 + 𝑉 ⃗ 𝑛+1 with the update of the speed vector given by: ‖𝑉𝑛+1 ‖ = ‖𝑉𝑛 ‖ + 𝛿𝑉 𝑛 𝛼𝑛+1 = 𝛼𝑛 + 𝛿𝛼𝑛 Reward model : RL agents are strongly impacted by the reward model used during training. The environment implements the following reward schema: {︃ 𝐷2 𝐷1 𝐷2 𝑟𝑛−1 + (𝑑𝐷 𝑛 − 𝑑𝑛−1 ) 2 if 𝑑𝐷 𝑛 ≥ 𝑑𝑚𝑖𝑛 1 & 𝑑𝐷 𝑛 < 𝑑𝑚𝑖𝑛 2 𝑟𝑛 = 0 otherwise with 𝑑𝐷𝑛 the distance between the target and the delivery zone, 𝑑𝑛 the distance between 1 𝐷2 𝐷1 𝐷2 the target and the interception zone, and 𝑑𝑚𝑖𝑛 and 𝑑𝑚𝑖𝑛 their respective minimum since the episode’s beginning. The agent is rewarded when 𝑑𝐷 𝑚𝑖𝑛 is reduced. To avoid the degenerate 2 case where the attacker makes the target move away and closer continuously, the value of the reward at each step is increased as long as the target gets closer to the interception zone. It is reset to 0 otherwise. This reward model choice encourages the agent to divert the target with the most direct trajectory possible. Figure 4 shows two visualizations of the simulator. 5. Experimental results and discussion This section exhibits the efficiency of RL agents in hijacking the target delivery drone and discusses the different factors impacting the success of an interception. For the sake of simplicity, we fixed the target’s velocity to 400 ft/s. We trained three attacker agents 𝐴300 , 𝐴600 and 𝐴1000 capable of reaching the maximum speed of 300 ft/s, 600 ft/s, and 1000 ft/s (75%, 150% and 250% of the target’s speed) respectively. We used stable-baselines3 [55] David Bertoin† et al. CEUR Workshop Proceedings implementation of TD3 to train the attackers. Each agent was composed of an actor and two critics, and we used a two-layer feedforward neural network of 400 and 300 hidden nodes for both the actor and the critics. TD3 is an off-policy algorithm, during training transitions (𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 , 𝑟𝑡+1 ) are stored in a replay-buffer [52] and drawn randomly in the form of mini-batches during the weight update phase. We used a replay buffer of size 5.104 and a mini-batch size of 512. As suggested in [29], we added an action noise drawn from the Ornstein-Uhlenbeck process to promote exploration. We trained all agents for 6 mil- lion steps. Table 2 provides a complete description of the hyper-parameters used during training. Table 2 Table 3 Hyper-parameters used in TD3 Overall performance Parameter Value Agents Success Rate Training steps 6,000,000 𝐴300 >1% Learning rate 0.001 𝐴600 37% 𝛾 0.99 𝐴1000 94% Policy delay 2 𝜏 0.005 target policy noise 0.2 Ornstein-Uhlenbeck Noise 0.01 Replay buffer size 50,000 Mini-batch size 512 Hijacking success rate. We assessed the performance of the three attackers by Monte-Carlo sampling on 105 simulations varying all the initial positions in each episode. Results reported in Table 3 show that the interception of the target is possible, but highly dependent on the attacker’s speed relative to the target’s. The 𝐴1000 agent obtains a 94% success rate on its attacks. Conversely, the 𝐴300 agent only very rarely succeeds in performing an interception. The 𝐴600 agent manages to intercept the target in only 34% of the scenarios, although it goes slightly faster than the target. Figure 5: Empirical estimation of the attacker’s influence area. David Bertoin† et al. CEUR Workshop Proceedings Figure 6: Example of successful hijacking. Left: trajectories of the attacker and the target. Middle: attacker’s velocity during the episode. Right: distance between the attacker and the target during the episode. Impact of the speed. The previous experiment showed the importance of speed on the success rate of the attack. In some situations, a too low velocity does not allow the attacker to position himself in such a way that the ACAS-Xu detects a possible collision and thus have an opportunity to interfere with his target. We conducted a similar Monte Carlo experiment to evaluate the attacker’s area of influence according to its speed, fixing the delivery area’s position and its distance to the target. We compared the estimated influence area with a simplified theoretical one consisting of a disk 𝒟 centered on the delivery area, with a radius 𝑣𝑎𝑚𝑎𝑥 𝐷 𝑅= 𝑑 1 the distance between the target and the delivery area weighted by the ratio 𝑣𝑡 between the maximum speed of the attacker and the speed of the target. Figure 5 shows the estimated areas for the three agents. The agents 𝐴600 and 𝐴1000 almost always succeed in hijacking the target in the theoretical influence area. Their estimated area is even slightly larger than the theoretical one. Conversely, agent 𝐴300 almost never succeeds in intercepting the target, even in theoretically favorable situations. This reveals that the ratio between the speeds of the attacker and his target does not only impact the attacker’s surface of influence. When the attacker’s maximum speed is lower than the target’s, the agent cannot influence the trajectory sufficiently. We demonstrate the need for the attacker to be able to fly faster than his target by looking at an example of a successful hijacking on the agent 𝐴600 . Figure 6 shows the trajectory of the attacker and its speed variations during a successful attack. For this example, the agent first heads with maximum speed towards the target to penetrate its close vicinity. In this example, the agent first heads towards the target at maximum speed to penetrate its immediate vicinity. In a second time, he approaches his target’s speed to be able to divert it with more precision. We can then observe through some peaks that the agent sometimes needs to suddenly increase its speed above 400ft/s to keep control over the target. Figure 7 shows an example of attempt of hijacking when the attackers are in the close vicinity of the target. The fastest agent 𝐴1000 smoothly divert the target towards the interception area while the 𝐴600 agent succeeds in his attack in a less straightforward fashion. The 𝐴300 agent fails in diverting the target. It is easily outpaced by the target which avoids it before resuming its trajectory towards the delivery zone. David Bertoin† et al. CEUR Workshop Proceedings Figure 7: Examples of hijacking attempt trajectories when the attacker is in the close vicinity of the target. Left: 𝐴1000 . Middle: 𝐴600 . Right: 𝐴1000 . Figure 8: Pairs of scenarios (𝐼𝑡 , 𝐷2 ) leading to a poor success rate for agent 𝐴1000 . Left: clusters of scenarios where target’s initial positions are too close to the delivery area. Right: clusters of scenarios where the delivery area is in between 𝐼𝑡 and 𝐷2 . Impact of the geometry. With speed much higher than the target, the agent 𝐴1000 almost systematically succeeds in the interception. We can distinguish two cases where the hijacking does not succeed despite the speed difference. Figure 8 shows examples of these situations. The first group consists of situations where the target is already too close to the delivery area and where the agent cannot interact quickly enough with the target. The second group involves situations where the delivery area is located between the target and the interception zone. It may happen, in these situations, that the policy learned by the attacker does not succeed in deviating the target’s trajectory enough to prevent it from reaching the delivery zone. We believe that fine-tuning the reward function would improve the agent’s performance in these situations. Since our goal is to demonstrate the feasibility of the attack and not to obtain a better score, we consider the search for the best policy outside the scope of this work. David Bertoin† et al. CEUR Workshop Proceedings 6. Conclusion This work highlights a security flaw in automatic collision avoidance systems designed to equip future unmanned aerial vehicles. Through the example of a delivery drone equipped with the ACAS-Xu avoidance system, we demonstrate the possibility of diverting it from its trajectory to an interception zone using an agent trained with Reinforcement Learning. Although we made simplifying assumptions in modeling the problem (a single avoidance system on a horizontal plan), we believe that our method would generalize to other configurations with additional degrees of freedom. Our contribution is not limited to a new hacking method. We believe that it demonstrates the effectiveness of Reinforcement Learning in finding security holes for these autonomous systems, and thus opens the door to future certification processes based on RL adaptive stress testing. Since these subjects are essential for accepting and developing future autonomous vehicles, we make our simulator available to the community. In a future work, we consider training both agents with Reinforcement Learning in a zero-sum two-player game (as in [21, 22]) to produce collision avoidance policies robust to malicious attacks. Acknowledgement This project received funding from the French "Investing for the Future – PIA3" program within the Artificial and Natural Intelligence Toulouse Institute (ANITI). The authors gratefully acknowledge the support of the DEEL project2 . References [1] Code of Federal Regulations CFR 91.221, General operating and flights rules, Subpart C–Equipment, Instrument, and Certificate Requirements, 2016. [2] Code of Federal Regulations CFR 135.180, Operating requirements: commuter and on demand operations and rules governing persons on board such aircraft; Subpart c - aircraft and equipment ;135.180 Traffic Alert and Collision Avoidance System, 1994. [3] TSO-C119B; Traffic Alert and Collision Avoidance System (TCAS) Airborne Equipment, TCAS II., 1998. [4] RTCA - DO-185A Minimum Operational Performance Standards for Traffic Alert and Collision Avoidance System II (TCAS II) Airborne Equipment Volume I, 2013. [5] H. Teso, Aircraft hacking, 4th annual Hack in the Box (HITB) Security Conference in Amsterdam (2013). [6] H. Sathaye, D. Schepers, A. Ranganathan, G. Noubir, Wireless attacks on aircraft instrument landing systems, in: 28th {USENIX} Security Symposium ({USENIX} Security 19), 2019, pp. 357–372. [7] R. N. Akram, K. Markantonakis, R. Holloway, S. Kariyawasam, S. Ayub, A. Seeam, R. Atkinson, Challenges of security and trust in avionics wireless networks, in: 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC), 2015, pp. 4B1–1–4B1–12. doi:10.1109/DASC.2015.7311416. 2 https://www.deel.ai/ David Bertoin† et al. CEUR Workshop Proceedings [8] R. santamarta, Arm ida and cross check: Reversing the 787’s core network, https://act-on. ioactive.com/acton/attachment/34793/f-cd239504-44e6-42ab-85ce-91087de817d9/1/-/-/-/ -/Arm-IDA%20and%20Cross%20Check%3A%20Reversing%20the%20787%27s%20Core% 20Network.pdf, 2019. [9] M. J. Kochenderfer, J. E. Holland, J. P. Chryssanthacopoulos, Next-generation airborne collision avoidance system, Technical Report, Massachusetts Institute of Technology- Lincoln Laboratory Lexington United States, 2012. [10] EUROCAE WG 75.1 /RTCA SC-147, Minimum Operational Performance Standards For Airborne Collision Avoidance System Xu (ACAS Xu), 2020. [11] F. Netjasov, A. Vidosavljevic, V. Tosic, M. H. Everdij, H. A. Blom, Development, valida- tion and application of stochastically and dynamically coloured petri net model of acas operations for safety assessment purposes, Transportation Research part C: emerging technologies 33 (2013) 167–195. [12] J.-B. Jeannin, K. Ghorbal, Y. Kouskoulas, R. Gardner, A. Schmidt, E. Zawadzki, A. Platzer, Formal verification of acas x, an industrial airborne collision avoidance system, in: 2015 International Conference on Embedded Software (EMSOFT), 2015, pp. 127–136. doi:10. 1109/EMSOFT.2015.7318268. [13] I. Lahsen-Cherif, H. Liu, C. Lamy-Bergot, Real-time drone anti-collision avoidance systems: an edge artificial intelligence application, in: 2022 IEEE Radar Conference (RadarConf22), IEEE, 2022, pp. 1–6. [14] K. D. Julian, J. Lopezy, J. S. Brushy, M. P. Owenz, M. J. Kochenderfer, Deep neural network compression for aircraft collision avoidance systems, 35th Digital Avionics Systems Conference (DASC) (2016). [15] W. Xiang, Z. Shao, Approximate bisimulation relations for neural networks and application to assured neural network compression, arXiv preprint arXiv:2202.01214 (2022). [16] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, M. J. Kochenderfer, Reluplex: An efficient SMT solver for verifying deep neural networks, CoRR abs/1702.01135 (2017). [17] A. Clavière, E. Asselin, C. Garion, C. Pagetti, Safety verification of neural network con- trolled systems, in: 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), IEEE, 2021, pp. 47–54. [18] M. Damour, F. D. Grancey, C. Gabreau, A. Gauffriau, J.-B. Ginestet, A. Hervieu, T. Huraux, C. Pagetti, L. Ponsolle, A. Clavière, Towards certification of a reduced footprint acas-xu system: A hybrid ml-based solution, in: International Conference on Computer Safety, Reliability, and Security, Springer, 2021, pp. 34–48. [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, nature 518 (2015) 529–533. [20] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al., Grandmaster level in starcraft ii using multi-agent reinforcement learning, Nature 575 (2019) 350–354. [21] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep neural networks and tree search, nature 529 (2016) 484–489. [22] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, David Bertoin† et al. CEUR Workshop Proceedings D. Kumaran, T. Graepel, et al., A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, Science 362 (2018) 1140–1144. [23] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, et al., Mastering atari, go, chess and shogi by planning with a learned model, Nature 588 (2020) 604–609. [24] R. C. Dorf, R. H. Bishop, Modern control systems, Pearson Prentice Hall, 2008. [25] S. L. Waslander, G. M. Hoffmann, J. S. Jang, C. J. Tomlin, Multi-agent quadrotor testbed control design: Integral sliding mode vs. reinforcement learning, in: 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2005, pp. 3712–3717. [26] K. D. Young, V. I. Utkin, U. Ozguner, A control engineer’s guide to sliding mode control, IEEE transactions on control systems technology 7 (1999) 328–342. [27] J. Hwangbo, I. Sa, R. Siegwart, M. Hutter, Control of a quadrotor with reinforcement learning, IEEE Robotics and Automation Letters 2 (2017) 2096–2103. [28] W. Koch, R. Mancuso, R. West, A. Bestavros, Reinforcement learning for uav attitude control, ACM Transactions on Cyber-Physical Systems 3 (2019) 1–21. [29] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, arXiv preprint arXiv:1509.02971 (2015). [30] J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy optimization, in: International conference on machine learning, PMLR, 2015, pp. 1889–1897. [31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017). [32] P. Becker-Ehmck, M. Karl, J. Peters, P. van der Smagt, Learning to fly via deep model-based reinforcement learning, arXiv preprint arXiv:2003.08876 (2020). [33] H. X. Pham, H. M. La, D. Feil-Seifer, L. V. Nguyen, Autonomous uav navigation using reinforcement learning, arXiv preprint arXiv:1801.05086 (2018). [34] C. J. Watkins, P. Dayan, Q-learning, Machine learning 8 (1992) 279–292. [35] G. Tong, N. Jiang, L. Biyue, Z. Xi, W. Ya, D. Wenbo, Uav navigation in high dynamic environments: A deep reinforcement learning approach, Chinese Journal of Aeronautics 34 (2021) 479–489. [36] B. Vlahov, E. Squires, L. Strickland, C. Pippin, On developing a uav pursuit-evasion policy using reinforcement learning, in: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE, 2018, pp. 859–864. [37] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, PMLR, 2016, pp. 1928–1937. [38] R. L. Shaw, Fighter combat, Tactics and Maneuvering; Naval Institute Press: Annapolis, MD, USA (1985). [39] S. Xuan, L. Ke, Uav swarm attack-defense confrontation based on multi-agent reinforce- ment learning, in: Advances in Guidance, Navigation and Control, Springer, 2022, pp. 5599–5608. [40] R. Lowe, Y. WU, A. Tamar, J. Harb, O. Pieter Abbeel, I. Mordatch, Multi-agent actor-critic for mixed cooperative-competitive environments, Advances in Neural Information Processing Systems 30 (2017) 6379–6390. David Bertoin† et al. CEUR Workshop Proceedings [41] M. Min, L. Xiao, D. Xu, L. Huang, M. Peng, Learning-based defense against malicious unmanned aerial vehicles, in: 2018 IEEE 87th Vehicular Technology Conference (VTC Spring), IEEE, 2018, pp. 1–5. [42] M. Gnanasekera, A. V. Savkin, J. Katupitiya, Range measurements based uav navigation for intercepting ground targets, in: 2020 6th International Conference on Control, Automation and Robotics (ICCAR), IEEE, 2020, pp. 468–472. [43] E. Çetin, C. Barrado, E. Pastor, Counter a drone in a complex neighborhood area by deep reinforcement learning, Sensors 20 (2020) 2320. [44] Y. Cheng, Y. Song, Autonomous decision-making generation of uav based on soft actor- critic algorithm, in: 2020 39th Chinese Control Conference (CCC), IEEE, 2020, pp. 7350– 7355. [45] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in: International conference on machine learning, PMLR, 2018, pp. 1861–1870. [46] R. Lee, M. J. Kochenderfer, O. J. Mengshoel, G. P. Brat, M. P. Owen, Adaptive stress testing of airborne collision avoidance systems, in: 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC), IEEE, 2015, pp. 6C2–1. [47] R. Coulom, Efficient selectivity and backup operators in monte-carlo tree search, in: International conference on computers and games, Springer, 2006, pp. 72–83. [48] G. Manfredi, Y. Jestin, An introduction to acas xu and the challenges ahead, in: 35th Digital Avionics Systems Conference (DASC’16), 2016, pp. 1–9. [49] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018. [50] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, et al., Policy gradient methods for reinforcement learning with function approximation., in: NIPs, volume 99, Citeseer, 1999, pp. 1057–1063. [51] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Deterministic policy gradient algorithms, in: International conference on machine learning, PMLR, 2014, pp. 387–395. [52] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing atari with deep reinforcement learning, arXiv preprint arXiv:1312.5602 (2013). [53] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-learning, in: Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. [54] S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in: International Conference on Machine Learning, PMLR, 2018, pp. 1587–1596. [55] A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, N. Dormann, Stable baselines3, https://github.com/DLR-RM/stable-baselines3, 2019.