Inroads into Autonomous Network Defence using Explained Reinforcement Learning Myles Foley1 , Mia Wang1 , Zoe M2 , Chris Hicks2 and Vasilios Mavroudis2 1 Imperial College London 2 The Alan Turing Institute Abstract Computer network defence is a complicated task that has necessitated a high degree of human involve- ment. However, with recent advancements in machine learning, fully autonomous network defence is becoming increasingly plausible. This paper introduces an end-to-end methodology for studying attack strategies, designing defence agents and explaining their operation. First, using state diagrams, we visualise adversarial behaviour to gain insight about potential points of intervention and inform the design of our defensive models. We opt to use a set of deep reinforcement learning agents trained on different parts of the task and organised in a shallow hierarchy. Our evaluation shows that the resulting design achieves a substantial performance improvement compared to prior work. Finally, to better investigate the decision-making process of our agents, we complete our analysis with a feature ablation and importance study. Keywords Reinforcement Learning, Autonomous Cyber Defence, Deep Learning, Network Defence 1. Introduction Computer network security is characterised by an asymmetry as the defender needs to ensure constant protection of the network’s components, while the adversary can opportunistically single-out weak entry points. Such asymmetries have been identified and addressed in many other areas of cyber security. For example, cryptographic protocols (e.g., TLS) thwart denial of service attacks by ensuring that the prover commits enough computation cycles before the verifier does so. In network defence, however, the problem remains open as the task is complex [1] and involves a wide array of both attack vectors and mitigation tools. Thus, network defence is currently handled primary by human experts which entails high operational costs. RL, and particularly deep RL (DRL), excels in interactive tasks that cannot easily be solved using analytical solutions. Human and even super-human levels of performance have been achieved in a range of complex tasks including classic board games such as chess and Go [2, 3], video games ranging from classic Atari [4, 5] to multi-player real-time strategy games [6], autonomous driving [7], and robotics [8]. Recently, DRL has also been successfully applied to CAMLIS’22: Conference on Applied Machine Learning in Information Security (CAMLIS), October 20–21, 2022, Arlington, VA $ m.foley20@imperial.ac.uk (M. Foley); yixuan.wang18@imperial.ac.uk (M. Wang); zm@turing.ac.uk (Z. M); c.hicks@turing.ac.uk (C. Hicks); c.hicks@turing.ac.uk (V. Mavroudis)  0000-0002-0877-7063 (M. Foley); 0000-0002-0877-7063 (M. Wang) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) autonomous network defence [9], a highly interactive task where the defender proactively mon- itors the state of the network, identifies abnormalities, and acts to remediate them. Commonly, this takes the form of a shallow hierarchy of specialised subagents coordinated by a controller, any combination of these being autonomous. To date however, there has been limited consideration for the explainability of these models. Explainable AI has, in domains such as natural language processing and computer vision [10], proven useful not only for end users but also experts and developers of AI systems. DRL models are particularly challenging to explain because the neural networks which represent their agent policies are not readily understandable by humans. Nonetheless, the ability to explain and understand the actions of an autonomous defensive agent is critical. This work investigates, and answers in the affirmative, whether explainable RL (XRL) models and environments can improve autonomous defensive capabilities and aid in their development. 1.0.1. Contributions Our main contributions are: • We develop methodologies for visualising (i.e., explaining) attacker functionality in the CybORG cyber environment. Our methodology highlights previously undocumented differences in the adversary models and motivates two new controller architectures with improved classification accuracy. • We present the full details of our new controller and specialised subagent models. We then evaluate them against two classes of adversary in the CybORG environment realising substantial performance improvements. • We perform a feature ablation and importance study to understand the most influential elements in the observation space and explain our model outputs. 2. RL Background In this section we discuss the key RL techniques that are relevant for the rest of the paper. 2.1. Deep RL Algorithms 2.1.1. PPO Proximal Policy Optimisation (PPO) is an efficient policy gradient method [5] for DRL. It has been shown to outperform other popular algorithms such as A3C [2], achieving super-human performance in a variety of complex tasks including 49 separate ATARI arcade games [5]. Despite its effectiveness in very complex environments [11], it has seen only limited use in security settings [12, 13], PPO uses ∑︀a policy 𝜋𝜃 (𝜃 ∈ R) with an objective function that is defined by the total reward 𝐽(𝜃) = E𝜋0 [ ∞ 𝑡=0 𝛾 𝑟𝑡 ]. By formulating the objective function in this way actor-critic archi- 𝑡 tectures can be used: the actor selects an action which is evaluated by the critic. The policy gradient is then computed: ▽𝜃 𝐽(𝜃) = E𝜋0 [▽𝜃 log 𝜋𝜃 (𝑠, 𝑎)𝐴𝜋𝜃 (𝑠)] (1) 𝐴𝜋𝜃 (𝑠) is the advantage of taking action 𝑎 instead of the average action as computed by the policy 𝜋𝜃 (𝑄𝜋𝜃 (𝑠, 𝑎)𝑥 − 𝑉𝜋𝜃 (𝑠)) [14]. During gradient descent, PPO introduces a clipping function to both prevent reaching local optima during large updates and avoid smaller updates that significantly increase the length of training. 2.2. Curious Exploration Curiosity is a technique that enables agents to explore their environment based on an intrinsic reward signal not provided by the environment [15]. Such a signal is particularly useful in the absence of a continual extrinsic reward (e.g., the running score found in some games).Pathak et al. [15] introduce the Intrinsic Curiosity Module (ICM), a semi-supervised technique in which agents choose actions based on the uncertainty in the outcome of each action, intrinsically motivating the exploration of unknown states. ICM also ensures that agents are only incentivised to reach states that are impacted by their actions, avoiding those which are inherently unpredictable. 2.3. Explainable RL Explainable RL (XRL), a fledgling sub-field of explainable AI, is the study of tools and methods which enhance human understanding of the actions taken by autonomous agents. A recent and thorough review of XRL is provided by Heuillet at al. [16] and separately by Puiutta and Veith [17]. XRL methods are commonly divided between those which are intrinsic, sometimes called transparent, and those which are post-hoc. Intrinsic XRL models are inherently inter- pretable and offer explainability at the time of training. In contrast, post-hoc explainability occurs after training; often by creating a second, simpler model to provide explanations. In DRL, learned policies are represented by neural networks making them difficult to interpret. Post-hoc explainability allows the performance advantages of DRL [3] to be retained whilst facilitating human understanding of autonomous decision making. Explainability is not limited to users and experts affected by the decisions of models but, as in this work, is a valuable researcher’s aid in developing more efficient and higher-performance models. 3. Network Simulation Environment We use the CybORG environment [18] which simulates the computer network of a manufac- turing plant, as shown in Figure 1. The network consists of five user hosts (Subnet 1), three enterprise servers (Subnet 21 ), three operational hosts and the operational server (Subnet 3). Each host exposes a number of network services that other hosts can connect to, and which may have exploitable vulnerabilities. However, due to the network’s firewalls hosts in Subnet 1 cannot directly connect to machines in Subnet 3, and the operational server is accessible only through the operational hosts. The liveness of the operational server has a direct impact on the manufacturing and is considered critical. CybORG assumes two players, a defender and an adversary, who interact with the turn based environment using the actions available to them. 1 Subnet 2 also includes the defender’s machine. Enterprise Servers Operational Server Operational Hosts User Hosts Subnet 1 Subnet 2 Subnet 3 Defensive Agent Figure 1: The CybORG environment showing the three subnets and their corresponding hosts and firewalls. A common drawback of simulated environments in RL is the reality gap which causes agents not to generalise sufficiently when moved from the simulation (i.e., training) to reality (i.e., evaluation). This is due to the simulation not adequately matching reality (e.g., in robotics). To address this, CybORG provides a network emulator that runs on Amazon Web Services (AWS). The combination of simulation and emulation ensure that the reality gap is minimised, with the actions available and their effect on the environment consistent across both [18]. The CybORG environment is host to the ‘Cyber Autonomy Gym for Experimentation’ (CAGE) challenge [19, 20, 21]. CAGE is an international Kaggle-style competition, providing an in- creasingly challenging benchmark for the evaluation of autonomous defensive agents. The competition is currently in its second iteration (CAGE II). 3.1. Action Space Attackers and defenders have unique action spaces. Defenders perform actions at the host level: 1) Analysing the processes running, 2) Terminating malicious processes, 3) Restoring the host to a previous (benign) state, and 4) Deploying honeypot 2 services. Adversaries can: 1) Scan a subnet for hosts, 2) Scan the ports of a host, 3) Exploit a service on a port, 4) Escalate their access, and 5) Disrupt the services on the operational server. Both players have a ‘sleep’ action to perform no action on the network. Based on the selected actions, the environment updates its state and updates the agents’ scores. It should be noted that even valid actions may not succeed, as the CybORG simulator introduces randomness to mimic the behaviour of the emulator (e.g. a valid node restoration may occasionally fail). 3.2. Observation Space The defender’s observation space is a vector of 52 bits i.e., 4 bits for every network host. The first two bits represent whether the host state is unknown (none), scanned or exploited (set when a decoy is triggered); the last 2 bits specify the access the attacker has on the host machine (i.e., none, user and administrator). 2 Honeypot refers to a decoy system or service that lures attackers by appearing to suffer from known security vulnerabilities. Honeypots are used to detect malicious actors and study their behaviour. As in a real network defence situation, neither the defender nor the adversary is omniscient. Neither agent knows the state of the network or the other’s position with absolute certainty. In addition, the outcomes of actions are stochastic. For example, from the defender’s perspective, when an exploit fails it is not possible to precisely determine which exploit was attempted. This can be crucial information in the instance that an adversary favours a specific exploit strategy. A better informed defender could strategically place decoys on the targeted service to frustrate and evade further attempts more effectively. 3.3. Reward Function Most games include a scoring function that quantifies the performance of the player. Similarly, CybORG uses a reward function that rewards the adversary and penalises the defender for every compromised or impacted network host. The reward function is as follows: on each turn, for every host on which the adversary has admin access, the defender receives a reward of -0.1 and for every server the reward is -1. There is a -10 reward for disruption on the operational server and a -1 reward when any device is ‘restored’. In the context of RL, the negative reward for the defensive agent incentivises the agent to take actions that minimise the effect of the adversary. 3.4. Adversaries The environment includes two adversaries: the BLineAgent that has prior knowledge (i.e., full knowledge of the network’s structure but not its current state), and the MeanderAgent which does not have any prior information. Both agents share the same objective, to reach the operational server and, after escalating their privileges, disrupt its services (i.e., impact its liveness). Due to prior knowledge, the BLineAgent follows an optimal exploitation trajectory to the operational server. In contrast, the MeanderAgent breadth-wise scans the network for vulnerable hosts and gradually traverses the subnets. To prevent trivial defence strategies, the adversary is given user access on a predetermined host (in Subnet 1) that cannot be ‘restored’ to a benign state by the defender. MeanderAgent Defence Controller BLineAgent Defence Figure 2: Hierarchical structure of the overall defensive model including the specialist subagents. 4. Model The models that we train have a similar basic structure to those described in [9] that were trained for CAGE I. In particular, we focus our efforts on training a hierarchy of specialised defensive agents using DRL. These agents feature a controller agent that, at each time step, chooses a subagent to perform the action. Each subagent is trained against a specific adversarial strategy. As described in Section 3.4, the environment includes two adversaries. The hierarchical architecture was developed specifically to exploit this. The model supports two expert subagents that, through the controller, are ‘consulted’ over the course of an episode (Figure 2). This avoids the performance limitations of a single, more general agent. Given the differences in the two adversaries, each subagent requires a different neural architecture for best performance. These are described below. 4.1. MeanderAgent Defence Our MeanderAgent defensive subagent was trained using the PPO algorithm and utilises a comparatively deeper neural network including three hidden layers with widths 256, 256, and 52. Full details of the hyperparameters used can be found in Appendix A. Notably, curiosity did not improve the performance. Since the MeanderAgent is explicitly designed to explore the network during its attack, the opposing defender is also be forced to explore more broadly and to employ a wider range of strategies during training. As such, it learns sufficiently general strategies without the need for curiosity. 4.2. BLineAgent Defence In contrast to the MeanderAgent, the BLineAgent follows a near-optimal path through the network. The BLineAgent defence, therefore, is at much greater risk of overfitting during training. As a result, we found that when training defensive agents against the BLineAgent, it was beneficial to include the curiosity mechanism. In this paper we consider two subagents for BLineAgent defence: an Action Knowledge (AK) subagent, and a State Representation (SR) subagent. Both are trained using PPO with curiosity but make different modifications to the state space. The AK subagent modifies each observation by appending a single bit indicating the success of the previous action. We find that this gives the subagent a better understanding of the defensive process and results in an improvement in performance. Secondly the SR agent is identical to the AK subagent, but receives observations of 27 floats as opposed to 53 bits. In this state space, each host has two floats to represent the features of activity and compromise. The additional float indicates whether the previous action succeeded. Although the mean episode reward is comparable to the AK agent’s mean reward, we see a notable decrease in variance. (a) (b) Figure 3: The action-outcome transition graphs of (a) the MeanderAgent and (b) the BLineAgent adversary in steps 1-4 of the CAGE II CybORG environment. 5. Explaining the Adversary Model The behaviour of the adversaries is dependent on the network topology and the choice of defensive actions. In addition, there is stochasticity in both the choice and outcome of actions across all of these components. Explaining adversarial behaviour proved essential in developing effective defensive models. To better understand each adversary we, at each time step, record the choice of action, outcome and the resulting state transition. For consistency across multiple episodes we resolve IP ranges and addresses to subnets and hostnames, respectively. We observe that the connectivity (i.e., the edges) of the resulting graph provides a clear signal for differentiating the two adversaries. Figure 3 shows a subset of the observations, recorded during the first four steps of adversarial behaviour, in which the BLineAgent and MeanderAgent can be seen adopting a depth-first and breadth-first approach to attacking the network, respectively. In Section 6 we present two methods which make use of this observation to more accurately determine the class of adversarial threat than in prior work [9]. In Appendix B we include the fully extracted adversary specifications generated by our methodology. 6. Hierarchical RL Architecture In order to improve the performance of our defensive capability we explore the use of alternative controller models. We introduce two new types of controller for this task, one heuristic and another bandit-based. 6.1. Bandit Controller Model We employ a bandit controller that is based on the multi-armed bandit architecture. The task is to determine which of the adversaries is currently attacking the network, based on the sequence of observations. However, using a bandit or bandit-like approach comes with several challenges in this setting. In the traditional multi-armed bandit there is no notion of state: an agent takes actions and then observes the reward. However, in the CybORG environment a unique observation cannot be used to determine the current adversary. Thus sequences of observations need to be observed and, due to the stochasticity, there are multiple sequences that can be observed over a given number of timesteps. A single bandit predicting the adversary will do no better than 50%. This is analogous to the traditional multi-armed bandit setting. Consider the task of deter- mining which of two slot machines has the higher payout in a casino (A): the task is trivial after several attempts. Now consider a second identical casino (B) where the payout of the machines is flipped. Again, we can find the better machine in B after some error. Finally, consider being randomly placed in A or B and having only one attempt to select the slot machine with the highest payout. As we do not know which casino we are in (as everything is identical), the best possible guess rate is 50%. We are able to solve this problem by abstracting the observations (which casino you are in) from the bandit. In this way we define 𝑁𝑏 bandits, one for each of the observations. As such the observation is unique to the bandit predicting the adversary. While this could also be solved by a logistic regression model, the Bandit Controller is able to learn with fewer samples, also being able to determine new adversary behaviours and learn to predict them in an online fashion. 6.1.1. Bandit Controller Implementation The bandit learning algorithm, shown in Algorithm 1, allows the bandit controller to track the states that it has previously seen, creating a new bandit for each newly seen state. Each of these bandits is initialised with 𝑄 values for each of the actions 𝑎 ∈ {0, 1, 2}, where these correspond to the MeanderAgent, BLineAgent, and no adversary. The 𝑄 values are updated using reward 𝑅 and the number of times that prediction has been selected, 𝑁 (𝐴). We train the bandit controller for 15,000 timesteps, using 𝑒𝑝𝑠𝑖𝑙𝑜𝑛 = 0.01. The Bandit Controller has a state different to that of its subagents. Its state is a sliding window of the last four timesteps from the CybORG environment. As we can see from Figure 3b, the minimum number of actions before an adversary has user privilege (and the first unambiguous instance of malicious behaviour) is three. A defensive agent can observe this on the fourth timestep, hence a prediction from the bandit controller only needs to happen once per episode. Finally, we use a simple reward function of +1 for a correct prediction, and -1 for an incorrect prediction. 6.2. Heuristic Controller Model We also construct a heuristic for predicting the adversary. This approach is possible as we are able to observe the patterns that the adversaries display in a controlled version of the CybORG environment. As we can see in Figures 3b and 3a, the BLineAgent and MeanderAgent have fundamentally different strategies in the first four moves they make. Using this privileged view of the adversarial behaviour allows for a manual and formal definition of the behaviour, as defined in Heuristic 1. As in the Bandit Controller we use this heuristic once per episode, on the fourth timestep, to determine which adversary is attacking the network. Algorithm 1 Bandit Controller Learning Algorithm. Initialise the known states, 𝑠𝑛 Initialise set of bandits, 𝐵 Initialise for a = 1 to k: 𝑏𝑎𝑛𝑑𝑖𝑡0 .𝑄(𝑎) ← 0 // Initialise Q values and action counter for the first bandit 𝑏𝑎𝑛𝑑𝑖𝑡0 .𝑁 (𝑎) ← 0 Predict(𝑠): if 𝑠 ̸∈ 𝑠𝑛 : 𝑠𝑛 ← 𝑠 Initialise 𝑏𝑎𝑛𝑑𝑖𝑡𝑠 𝐵 ← 𝑏𝑎𝑛𝑑𝑖𝑡𝑠 {︃ 𝑎𝑟𝑔𝑚𝑎𝑥𝑎 (𝑏𝑎𝑛𝑑𝑖𝑡𝑠 .𝑄(𝑎)) with probability 1 − 𝜖 𝐴← 𝑟𝑎𝑛𝑑𝑜𝑚 𝑎𝑐𝑡𝑖𝑜𝑛 with probability 𝜖 𝑅 ← 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛_𝑟𝑒𝑠𝑢𝑙𝑡(𝐴) 𝑏𝑎𝑛𝑑𝑖𝑡𝑠 .𝑁 (𝐴) ← 𝑏𝑎𝑛𝑑𝑖𝑡𝑠 .𝑁 (𝐴) + 1 [︀ 1 𝑏𝑎𝑛𝑑𝑖𝑡𝑠 .𝑄(𝐴) ← 𝑏𝑎𝑛𝑑𝑖𝑡𝑠 .𝑄(𝐴) + 𝑁 (𝐴) 𝑅 − 𝑏𝑎𝑛𝑑𝑖𝑡𝑠 .𝑄(𝐴)] Heuristic 1. The scanning of two different hosts on the network within the first four timesteps indicates the presence of the MeanderAgent adversary. Otherwise, this is either the BLineAgent adversary or the User agent. 7. Evaluation In this section we evaluate the performance of our specialist subagents against the two adver- saries. We further investigate the performance of the controller models. Finally, we evaluate the full defensive model capable of defending against either adversary. We use the model described in prior work [9] as a baseline performance measure (baseline for brevity), as this has been established as state-of-the-art and achieved the best score in CAGE I. Because the scoring function assigns only penalty points (i.e., 0 is the theoretically maximum score), all the reported rewards are negative. 7.1. Specialised Sub Agents 7.1.1. Training Results Figure 4 shows the average reward of each defensive subagent as trained against the BLineAgent (left column), and the MeanderAgent (right column). The methods of AK and SR achieve peak rewards against the BLineAgent of -12.227 and -11.465 respectively, both of which are an improvement over the baseline [9] PPO with curiosity based model, which achieves -13.475. Furthermore, removing curiosity negatively impacts the reward against the BLineAgent, as shown clearly in the max reward plot of Figure 4(c) 0 Mean Reward Against BLine Adversary 0 Mean Reward Against Meander Ad ersary −10 −10 −20 −20 −30 −30 Reward Reward −40 −40 −50 Base_Bline −50 Base_Meander PPO_Bline PPO_Meander −60 AK_Bline −60 AK_Meander SR_Bline SR_Meander −70 −70 0M 2M 4M 6M 8M 10M 0M 2M 4M 6M 8M 10M Timesteps Timesteps (a) (b) 10 Max Reward Against BLine Adversary 10 Max Reward Against Meander Adversary 0 0 −10 −10 −20 −20 Reward Reward −30 −30 −40 Base_Bline −40 Base_Meander PPO_Bline PPO_Meander −50 AK_Bline −50 AK_Meander SR_Bline SR_Meander −60 −60 0M 2M 4M 6M 8M 10M 0M 2M 4M 6M 8M 10M Timesteps Timesteps (c) (d) 0 Min Reward Against BLine Adversary 0 Min Reward Against Meander Ad ersary −100 −100 −200 −200 −300 −300 Reward Reward −400 −400 −500 Base_Bline −500 Base_Meander PPO_Bline PPO_Meander −600 AK_Bline −600 AK_Meander SR_Bline SR_Meander −700 −700 0M 2M 4M 6M 8M 10M 0M 2M 4M 6M 8M 10M Timesteps Timesteps (e) (f) 0 Mean Reward with Standard Deviation Against BLine Mean 0 Reward with Standard Deviation Against Meander −10 −10 −20 −20 −30 −30 −40 −40 Reward Reward −50 −50 −60 −60 Base_Bline Base_Meander −70 PPO_bline −70 PPO_Meander AK_Bline −80 AK_meander −80 SR_meander SR_Bline −90 −90 0M 2M 4M 6M 8M 10M 0M 2M 4M 6M 8M 10M Timesteps Timesteps (g) (h) Figure 4: Mean, maximum and minimum reward of blue subagents against the BLineAgent (left) and the RedMeanerAgent (right) over 10 million timesteps. Training BLineAgent MeanderAgent Mean Defensive Model Adversary Mean Standard Mean Standard Mean Standard Reward Deviation Reward Deviation Reward Deviation MeanderAgent PPO -24.91 9.21 -17.71 5.06 -21.31 7.43 baseline -123.91 229.59 -30.39 47.74 -77.15 165.82 AK -14.43 30.28 -145.26 123.92 -79.84 90.20 BLineAgent SR -12.95 6.19 -269.64 235.24 -139.29 166.40 baseline -16.80 21.12 -201.13 143.40 -108.96 102.49 Table 1 The performance of the defensive subagents against their corresponding adversaries. Evaluated on 1,000 episodes of 100 timesteps each. The difference in mean reward is explained by the maximum and minimum rewards. All models apart from PPO experience a first plateau in maximum reward of -9 and then step up to a second plateau of around -1. The SR agent finds the optimal policy earlier than the AK agent during training. In addition, the minimum rewards of the baseline and PPO model have greater variance than the SR and AK agents, and the AK agent has a marginally higher probability to score very poorly (i.e., below -300). Earlier optimal policy convergence and smaller policy variability makes the SR agent the best model against BLineAgent agent. This corroborates the standard deviation graph, and 1,000-episode evaluation results; in Table 1, where the SR agent displays less negative reward and with a standard deviation that is only a fifth of the AK agent. Against the meander attacker the PPO and SR agents outperform the baseline (-24.384) with best mean rewards of -17.065 and -19.959 during training. Figure 4 shows the advantage of using a PPO 3-layer architecture which results in higher min and max rewards with reduced variance. 7.1.2. Specialist Agents Here we evaluate the performance of our defensive subagents against their separate adversaries. We select the best performing agents from training for evaluation evaluate: PPO defence for the RedMeander and both the AK and SR defence for the BLineAgent. We evaluate each for 1,000 episodes of 100 steps and summarise our results in Table 1. For completeness, we also cross-evaluate our agents against the adversary not seen during training. Against the RedMeander adversary, PPO defence outperforms the baseline against both adversaries resulting in a mean score of -21.3 (improvement by a factor of 3.6) and a reduction in standard deviation by a factor of more than 9. This highlights the advantage of the increased depth of the neural network over the baseline. Against the BLineAgent adversary, we see that the SR agent is able to achieve a 1.5 times greater reward, with 4.89 times lower standard deviation. However, this comes at the cost of generality. A trend in all of the subagents is that when defending against previously unseen adversaries, the performance is significantly diminished. Prediction Accuracy Controller Agent BLineAgent RedMeander PPO with curiosity (4 steps) 76.8% 0.0% PPO with curiosity (100 steps) 30.3% 42.9% Heuristic 100.0% 100.0% Bandit 100.0% 100.0% Table 2 Controller performance taken over 1,000 episodes of 4 steps, except in the case of PPO with curiosity (100 steps) which predicts the adversary at each timestep. 7.2. Controller Models As seen in Section 7.1.2, the defensive subagents do not generalise well beyond the adversaries that they are trained against. To address this, Sections 6.2 and 6.1 introduce two new controller architectures: Heuristic and Bandit. Here we evaluate the ability to correctly predict the adversary within the first four timesteps of an episode (as our controllers predict the adversary on the fourth timestep). For each episode, we randomly sampled one of the two red adversaries (i.e., 50% probability of selecting BLineAgent). Table 2 shows that the baseline model has strong biases on selecting the BLineAgent agent. To investigate further, we let the baseline agent make predictions on each timestep until the end of the episode (c.f. only guessing after the 4th timestep). As seen in Table 2, the repeated guesses significantly reduced bias but accuracy remained low. In contrast, neither our bandit or heuristic controller exhibit this bias and can perfectly predict the correct attacker type. 7.3. Hierarchical Defensive Model Here we evaluate the complete defensive model. Table 3 reports the mean and standard deviation for the ‘best pair’ combinations of subagents as determined by our evaluation in Section 7.1 ( i.e., PPO for MeanderAgent, and AK or SR for BLineAgent). We observe that the subagents play a significant role in the improvement over the baseline. Over episodes of 100 timesteps, we are able to improve the result by at least 30% for the BLineAgent and 170% for MeanderAgent. The lowest reward values are split evenly between the Heuristic and Bandit controllers. These models outperform the PPO controller models regardless of the subagents in four of the six combinations of adversary and episode length. MeanderAgent performance is improved by 11.7%, which is more significant than BLineAgent (only improved by 1%) when using Bandit or Heuristic controller. Table 1 indicates that models trained with BLineAgent perform poorly on MeanderAgent. This can be explained by the fact that BLineAgent has more information about the network, so its behaviour is more predictable. In contrast, MeanderAgent’s actions have more randomness. 30 steps 50 steps 100 steps Controller Subagents BLineAgent MeanderAgent BLineAgent MeanderAgent BLineAgent MeanderAgent PPO + AK -3.56±2.03 -6.80±1.40 -6.79±13.00 -10.10±2.30 -13.54±15.95 -17.30±4.27 Bandit PPO + SR -3.62±2.04 -6.88±1.42 -6.26±3.18 -10.06±2.15 -13.00±6.28 -17.56±4.51 PPO + AK -3.56±2.04 -6.80±1.40 -6.79±13.00 -9.96±2.33 -14.07±27.73 -17.57±4.82 Heuristic PPO + SR -3.71±2.09 -6.86±1.48 -6.17±3.40 -10.04±2.32 -13.06±6.14 -17.32±4.35 Baseline PPO + AK -4.35±2.42 -7.19±1.69 -7.45±4.27 -10.84±2.62 -14.97±8.09 -19.33±5.38 (PPO Controller) PPO + SR -3.95±2.18 -7.36±1.74 -6.38±3.20 -11.33±3.00 -13.14±6.45 -21.21±6.10 Baseline Baseline 4.82±4.22 -8.78±3.21 -9.20±16.01 -19.00±20.86 -18.49±34.40 -47.60±88.16 (PPO Controller) (PPO subagents) Table 3 Performance of all subagents-controller combinations, evaluated over 1,000 episodes with a length of 30, 50 and 100 steps each. 8. Explaining the Defensive Models It is critically important that human operators can understand the decisions made by autonomous agents. Using post-hoc XRL techniques, we determine whether our defensive agents are truly defending the network as their primary objective or as a side effect of an unintended objective. This is common in RL where agents may manipulate improperly specified reward mechanics to maximise their score in unintended ways. 8.1. Ablation Study To understand which of the features in the observation space influence the agents decision making we perform an ablation study over knowledge of: 1) the success or failure of the previous action (hence referred to as previous action), 2) the adversary’s access onto a host (hence referred to as adversary access), and 3) whether an adversary has scanned a host (hence referred to as adversary scan). The ablation results in Figure 5 show the AK and SR agents against the BLineAgent in 5a and 5b, and the PPO agent against the MeanderAgent3 in 5c. Figure 5a indicates that the AK agent’s performance is greatly affected by ‘adversary access’. While comparatively little impact seems to derive from the ablation of ‘adversary scan’ and ‘previous action’ there is some variance and the rewards fall to -812 and -539, respectively. Interestingly, the SR defensive agent is greatly affected by the ablation of the ‘adversary access’ and ‘adversary scan’, with the distribution of rewards being more negative in both cases. This is especially apparent in the case of ‘adversary scan’. Previous action has less of an effect in both AK and SR, yet still reduces the mean reward to -30.42 (a factor of 2) and -40.6 (a factor of 3), respectively. However, AK has some outlier scores that result in a minimum reward of -987.8. For PPO against MeanderAgent, Figure 5c shows that ablation of ‘adversary access’ causes a drastic reduction in reward, bringing the mean value to -781.23. Ablation of ‘adversary scan’ reduces the mean reward to -44.70, a factor of 2.52 more negative than when the observation is included. 3 This defensive agent doesn’t use the previous action however we include it for completeness. 0 0 0 −200 −200 −200 −200 −200 −200 −400 −400 −400 −400 −400 −400 Reward Reward Reward −600 −600 −600 −600 −600 −600 −800 −800 −800 −800 −800 −800 −1000 −1000 −1000 −1000 −1000 −1000 −1200 Adversary Access Adversary Scan Previous Action Adversary Access Adversary Scan Previous Action Adversary Access Adversary Scan Previous Action Ablation Ablation Ablation (a) (b) (c) Figure 5: Ablation results for the three feature types on the reward of the (a) AK BLineAgent defence, (b) SR BLineAgent defence and (c) PPO MeanderAgent defence. 8.2. Feature Importance To further validate the importance of ‘adversary access’, ‘previous action’, and ‘adversary scan’, we utilise a well known framework from explainable AI called SHapley Additive exPlanations (SHAP). This uses an implementation agnostic game theoretic approach to explain the impor- tance of features in determining outputs. SHAP is able to connect optimal credit allocations with local explanations to determine SHAPley values. These values provide a way of accurately distributing the contribution of the individual features within the complete feature space [22]. Figure 6 shows the SHAPley values for the trained AK and SR subagents against the BLineAgent in 6a and 6b, and PPO against the MeanderAgent in 6c. Each point on these plots is a feature in a specific observation, with the colour representing the value of that feature. All defensive agents observe the same trend in feature importance regardless of their training adversary: ‘adversary access’ is the most important followed by ‘adversary scan’, a trend that is also observed in Figure 5. Note that the PPO RedMeander defensive model doesn’t use ‘previous action’ and hence is not included in Figure 6c. We show that ‘adversary access’ is an important part of the observation. This indicates that the defensive agents are aware that they need to remove the attackers from hosts. The importance is also seen in Figure 5 as the most significant shifts in reward distribution occur when ablating ‘adversary access’. In addition ‘adversary scan’ is of importance to the agents which is clear in 5b as the defensive agent’s performance is significantly impacted in the absence of this information. This correlates with Figure 6b as ‘adversary scan’ has the greatest distribution of any of the SHAP values for the BLineAgent defensive agents. While knowledge of the ‘previous action’ has the lowest feature importance for the agents, we argue that this is still important for these defensive agents, which, with this knowledge, outperform the baseline and PPO-only models in 4. For example, take the case where a defensive agent acts to remove an adversary from a host, if this action fails then the defensive agent will have to adjust its strategy. The importance of this feature can further be seen in Figures 5a and 5b, as ablation of this feature has a non-trivial impact on the performance of the agents. High High Adversary Access Adversary Access Feature value Feature value Adversary Scan Adversary Scan Previous Action Previous Action Lo Lo −60 −40 −20 0 20 40 60 80 −75 −50 −25 0 25 50 75 100 SHAP value (impact on model output) SHAP value (impact on model output) (a) (b) High Feature value Adversary Access Adversary Scan Low −100 −50 0 50 100 SHAP value (impact on model output) (c) Figure 6: The SHAPley values for the three feature types on the reward of the (a) AK BLineAgent defence, (b) SR BLineAgent defence and (c) PPO MeanderAgent defence, from most to least important features. 9. Related Work The effectiveness of RL across a range of simulated and abstracted autonomous network defence scenarios is well established in the literature. Han et al. [23] show the feasibility and resilience of RL agents under causative attacks in software defined networks. Elderman et al. [24] model network defence using the framework of a Markov game with incomplete information, high- lighting the capabilities of even traditional RL methods (i.e., not DRL) in interactions between network attacker and defender. The hierarchical approach we build upon was first proposed by Foley et al. [9]. Comparatively, we propose two improved controller models based on a deeper understanding of the adversary models. We also develop improved subagents providing an ex- plainability analysis to understand what causes the agents to defend networks effectively. Other approaches to autonomous network defence include dynamic causal Bayesian optimisation [25] as shown by Andrew et al. [26]. Several alternative network defence simulation environments have been proposed in the literature. Molina-Markham et al. [27] propose FARLAND which similarly to CybORG provides a hybrid simulation and emulation based environment capable, owing to a rich feature space, develops agents that can defend real-world networks. Microsoft have an experimental research platform CyberBattleSim [28] that offers, at a high-level of abstraction, a simulation-only network defence environment based on post-breach lateral adversary movement and system exploitation. In contrast to CybORG, CyberBattleSim places greater emphasis on credential access and data collection such as simulating a GitHub project leaking credentials in the commit history. Another simulation-only environment developed by Andrew et al. [26] is Yawning Titan (YT). Of all the network defence environments, YT offers the greatest abstraction and omits the majority of individual host details (e.g., operating system processes, network ports) needed for emulation. RL has also be applied to several closely related problems. In penetration testing (i.e., exploita- tion which is a subset of the CybORG envionment), Yang and Liu [29] formulate automated penetration testing in the multi-objective RL framework and demonstrate superior performance. Independently, Tran et al. [30] explore hierarchical RL architectures for the same task based on their findings that decomposing large action spaces into smaller sets produces greater perform- ing agents. In intrusion prevention, Hammar and Stadler [31] demonstrate that RL is capable of intrusion prevention when formulated as a multiple stopping problem. Feng and Xu [32] train a defender to protect a single device from an unknown attacker and finally, Tahsini et al. [33] use a single defender model to protect a water tank system from adversarial attacks. 10. Conclusion Taking advantage of the rapidly increasing capabilities of neural networks and the advancements in RL algorithms, we present an improved approach to autonomous network defence. Beyond high performance, we place emphasis on the steps before and after training the model. Before training, we use a methodology to observe the adversary behaviour and inform choices in our hierarchical model. Specifically, we introduce two controller architectures, one heuristic and another bandit-based, that improve accuracy when predicting adversaries. Additionally we develop enhanced subagent architectures optimised for the specific classes of adversary. After training, our post-hoc analysis includes a feature importance and ablation study for each specialised subagent within the complete hierarchical model. Our results shed light on each agent’s decision making process and help to better understand the system as a whole. This work contributes to a less studied but equally important research direction for future works in autonomous network defence. Acknowledgements The authors would like to acknowledge that research was partially funded by EPSRC grant EP/T51780X/1. References [1] P. Speicher, M. Steinmetz, J. Hoffmann, M. Backes, R. Kunnemann, Towards automated network mitigation analysis, in: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, 2019. [2] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, PMLR, 2016, pp. 1928–1937. [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature (2015). [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing Atari with Deep Reinforcement Learning, arXiv:1312.5602 [cs] (2013). [5] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal Policy Optimization Algorithms, in: arXiv:1707.06347 [cs], 2017. [6] O. et al., Dota 2 with Large Scale Deep Reinforcement Learning, 2019. [7] A. E. Sallab, M. Abdou, E. Perot, S. Yogamani, Deep Reinforcement Learning framework for Autonomous Driving, Electronic Imaging 29 (2017) 70–76. URL: http://arxiv.org/abs/ 1704.02532. doi:10.2352/ISSN.2470-1173.2017.19.AVM-023, arXiv:1704.02532 [cs, stat]. [8] J. Kober, J. A. Bagnell, J. Peters, Reinforcement learning in robotics: A survey, Interna- tional Journal of Robotics Research 32 (2013) 1238–1274. URL: https://doi.org/10.1177/ 0278364913495721. doi:10.1177/0278364913495721. [9] M. Foley, C. Hicks, K. Highnam, V. Mavroudis, Autonomous Network Defence using Rein- forcement Learning, in: Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’22, Association for Computing Machinery, New York, NY, USA, 2022, pp. 1252–1254. doi:10.1145/3488932.3527286. [10] J. R. Williford, B. B. May, J. Byrne, Explainable Face Recognition, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Computer Vision - ECCV 2020, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 248–263. [11] C. Yu, A. Velu, E. Vinitsky, Y. Wang, A. Bayen, Y. Wu, The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games, 2022. URL: http://arxiv.org/abs/2103.01955. doi:10.48550/arXiv.2103.01955, arXiv:2103.01955 [cs]. [12] T. T. Nguyen, V. J. Reddi, Deep Reinforcement Learning for Cyber Security, 2021. [13] X. Wu, W. Guo, H. Wei, X. Xing, Adversarial Policy Training against Deep Reinforcement Learning (2021) 1883–1900. URL: https://www.usenix.org/conference/usenixsecurity21/ presentation/wu-xian. [14] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, Adaptive computation and machine learning series, 2nd ed., 2018. [15] D. Pathak, P. Agrawal, A. A. Efros, T. Darrell, Curiosity-driven exploration by self- supervised prediction, in: Proceedings of the 34th International Conference on Machine Learning, ICML’17, 2017. [16] A. Heuillet, F. Couthouis, N. Díaz-Rodríguez, Explainability in deep reinforcement learning, Knowledge-Based Systems (2021). URL: https://www.sciencedirect.com/science/article/pii/ S0950705120308145. [17] E. Puiutta, E. Veith, Explainable Reinforcement Learning: A Survey, in: 4th International Cross-Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE), 2020. URL: https://hal.inria.fr/hal-03414722. [18] M. Standen, M. Lucas, D. B., T. J. Richer, J. Kim, D. Marriott, Cyborg: A gym for the development of autonomous cyber agents, in: IJCAI-21 1st International Workshop on Adaptive Cyber Defense, 2021. [19] M. Standen, D. Bowman, S. Hoang, T. Richer, M. Lucas, R. Van Tassel, Cyber Autonomy Gym for Experimentation Challenge 1, https://github.com/cage-challenge/cage-challenge-1, 2021. [20] M. Standen, D. Bowman, S. Hoang, T. Richer, M. Lucas, R. Van Tassel, P. Vu, M. Kiely, Cyber autonomy gym for experimentation challenge 2, https://github.com/cage-challenge/ cage-challenge-2, 2022. Created by Maxwell Standen, David Bowman, Son Hoang, Toby Richer, Martin Lucas, Richard Van Tassel, Phillip Vu, Mitchell Kiely. [21] CAGE, Cage challenge 1, in: IJCAI-21 1st International Workshop on Adaptive Cyber Defense., 2021. [22] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: Pro- ceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., Red Hook, NY, USA, 2017, pp. 4768–4777. [23] Y. Han, B. I. P. Rubinstein, T. Abraham, T. Alpcan, O. De Vel, S. Erfani, D. Hubczenko, C. Leckie, P. Montague, Reinforcement Learning for Autonomous Defence in Software- Defined Networking, arXiv:1808.05770 [cs, stat] (2018). URL: http://arxiv.org/abs/1808. 05770, arXiv: 1808.05770. [24] R. Elderman, L. J. J. Pater, A. S. Thie, M. M. Drugan, M. A. Wiering, Adversarial Rein- forcement Learning in a Cyber Security Simulation, in: Proceedings of the 9th Inter- national Conference on Agents and Artificial Intelligence - Volume 1: ICAART, 2017. doi:10.5220/0006197105590566. [25] V. Aglietti, N. Dhir, J. González, T. Damoulas, Dynamic Causal Bayesian Optimization, in: Advances in Neural Information Processing Systems, editor = M. Ranzato and A. Beygelzimer and Y. Dauphin and P.S. Liang and J. Wortman Vaughan, volume 34, 2021. [26] A. Andrew, S. Spillard, J. Collyer, N. Dhir, Developing Optimal Causal Cyber-Defence Agents via Cyber Security Simulation, in: Workshop on Machine Learning for Cyber- security (ML4Cyber) as part of the Proceedings of the 39th International Conference on Machine Learning, 2022. [27] A. Molina-Markham, C. Miniter, B. Powell, A. Ridley, Network Environment Design for Autonomous Cyberdefense (2021). URL: https://arxiv.org/abs/2103.07583. [28] J. Bono, W. Blum, Cyberbattlesim, https://github.com/microsoft/CyberBattleSim, 2021. [29] Y. Yang, X. Liu, Behaviour-Diverse Automatic Penetration Testing: A Curiosity-Driven Multi-Objective Deep Reinforcement Learning Approach, 2022. URL: https://arxiv.org/abs/ 2202.10630. doi:10.48550/ARXIV.2202.10630. [30] K. Tran, A. Akella, M. Standen, J. Kim, D. Bowman, T. Richer, C. Lin, Deep hierarchical reinforcement agents for automated penetration testing (2021). URL: https://arxiv.org/abs/ 2109.06449. [31] K. Hammar, R. Stadler, Learning Intrusion Prevention Policies through Optimal Stopping, in: 2021 17th International Conference on Network and Service Management (CNSM), 2021. doi:10.23919/CNSM52442.2021.9615542. [32] M. Feng, H. Xu, Deep reinforecement learning based optimal defense for cyber-physical system in presence of unknown cyber-attack, in: 2017 IEEE Symposium Series on Compu- tational Intelligence (SSCI), IEEE, 2017. [33] A. Tahsini, N. Dunstatter, M. Guirguis, C. M. Ahmed, DeepBLOC: A Framework for Securing CPS through Deep Reinforcement Learning on Stochastic Games, in: 2020 IEEE Conference on Communications and Network Security (CNS), 2020, pp. 1–9. doi:10.1109/ CNS48642.2020.9162219. Agent Parameter Value gamma 0.99 network layers [256, 256] AK Curiosity Yes Beta (Curiosity) 0.2 Eta (Curiosity) 1 Feature Dimension (Curiosity) 53 Learning Rate (Curiosity) 0.001 Learning Rate 0.0005 gamma -17.710 network layers [256, 256] SR Curiosity Yes Beta (Curiosity) 0.2 Eta (Curiosity) 1 Feature Dimension (Curiosity) 53 Learning Rate (Curiosity) 0.001 Learning Rate 0.0005 gamma 0.99 network layers [256, 256, 52] PPO Curiosity No Learning Rate 0.0005 Bandits epsilon 0.01 Table 4 Hyperparameters. A. Hyperparameter Values Optimal, lower and upper bounds of the of the hyperparameters are shown in Table 4. A uniformly sampled grid search was used to determine the optimal values. B. Extended adversary models Here we provide the full action-outcome transition graphs for the BLineAgent adversary, both with and without the presence of our defensive model. Table 5 provides the definitions of all the acronyms used. Acronym Definition DRS Discover Remote Systems DNS Discover Network Services ERS Exploit Remote Service PE Privilege Escalate Table 5 Acronymns used in the action-outcome transition graphs. Figure 7: Action-outcome transition graph of the BLineAgent adversary without defensive action. Figure 8: Action-outcome transition graph of the BLineAgent adversary faced with our fully-trained defensive model. c.f. Figure 7, new states and transitions are shown in red.