Defending the unknown: Exploring reinforcement learning agents’ deployment in realistic, unseen networks Alberto Acuto1,* , Simon Maskell1 and Jack D.2 1 School of Electrical Engineering, Electronics and Computer Science, The University of Liverpool, Brownlow Hill, Liverpool, L69 3GJ 2 The Alan Turing Institute Abstract The increasing number of network simulators has opened opportunities to explore and apply state-of- the-art algorithms to understand and measure the capabilities of such techniques in numerous sectors. In this regard, the recently released Yawning Titan is one example of a simplistic, but not less detailed, representation of a cyber network scenario where it is possible to train agents guided by reinforcement learning algorithms and measure their effectiveness in trying to stop an infection. In this paper, we explore how different reinforcement learning algorithms lead the training of various agents in different examples and realistic networks. We assess how we can deploy such agents in a set of networks, focusing in particular on the resilience of the agents in exploring networks with complex starting states, increased number of routes connecting the nodes and different levels of challenge, aiming to evaluate the deployment performances in realistic networks never seen before. Keywords Reinforcement Learning, Cyber security, Network simulation 1. Introduction The development of autonomous resilient agents in the context of automated cyber defence (ACD) to counteract the actions of external or malevolent actors is becoming a pivotal research topic from both academy and governmental agencies. In recent years cyber crimes have increased their presence in the day-to-day life of organisations and governmental institutions, and research in automated cyber defence is one of the most developed topics [1]. Novel technologies such as machine learning (ML) and reinforcement learning (RL) are increasingly employed for both defence and attack thanks to their adaptability, cyber resilience and variety of applications. Some defensive examples are ML applications in spam detection [2], malware and intrusion detection [3, 4], offensive applications can relate to the deployment of algorithms to exploit vulnerabilities of infrastructures and limit the visibility and extend duration (or frequency) of CAMLIS’23: Conference on Applied Machine Learning for Information Security, October 19–20, 2023, Arlington, VA * Corresponding author. $ a.acuto@liverpool.ac.uk (A. Acuto); s.maskell@liverpool.ac.uk (S. Maskell); j.d@turing.ac.uk (J. D.) € https://github.com/A-acuto (A. Acuto); http://www.simonmaskell.com/ (S. Maskell)  0000-0003-0753-5131 (A. Acuto); 0000-0003-1917-2913 (S. Maskell) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings threats [5]. However, simpler ML models are prone to react slowly to changes in the system and not in real-time, while RL algorithms tend to be more flexible to the changes and they had been successfully deployed in the detection of spoofing attacks and DDoS attacks [5, 6]. A relevant summary of the current state-of-the-art in environments where this problem is tackled can be found in several review articles [6, 7, 8, 9], and, particularly, in the work by Wang W. et al. [7], in which the authors consider the role of RL as new technology to tackle cyber defence decision making. In order to develop realistic cyber scenarios, a number of autonomous cyber operations gyms (ACO) have been developed [6], some examples are CybORG [10], TTCP CAGE challenges [11, 12, 13] and FARLAND [14]. A recent one consists of a network simulator developed in the context of UK ARCD program1 , Yawning Titan2 ([15], hereafter, YT), that offers an environment where we can train and explore the capabilities of RL agents to counteract the actions of an enemy red agent. YT is a complex piece of software that models the intrusion of a “red agent” into a network and a defendant “blue agent” needs to counteract the threats posed. In this paper, we have considered several RL algorithms based on discrete actions space, following the YT simulations reward-action mechanism, to train and evaluate the agents’ performances in different scenarios from size and complexity in order to test the resilience of the agents. Then, after having identified the best-performing algorithms, we deployed these agents on unseen realistic networks with training done on synthetic cases. This paper is structured as follows: in Subsection 1.1 we present the Yawning Titan software, in Section 2 we highlight briefly the algorithms considered in this work. In Section 3 we present the simulation and experiment design, in detail in Subsection 3.2 we train and evaluate the agents on a set of example networks looking for the best performing algorithm and in Subsection 3.3 we explore the deployment of such agents in realistic network environments. Finally, in Section 4 we summarise the findings. The codes presented in this paper are available on the following github repository https://github.com/A-acuto/RLYawningTitan. 1.1. Yawning Titan Yawning Titan is a graph-based cyber-security simulation environment that allows the training of intelligent agents to counteract the actions of a red enemy agent that aims to spread into the network. The YT setup specifies the red and blue agents’ capabilities (e.g. the usable actions and success rate) and the network’s description: connections between the nodes, entry nodes and the presence and location of a “high-value target” (HVT). Each agent has a set of parameters describing the probability of success of the actions and game rules (i.e., how the red agent can spread from a compromised node or if the blue agent can detect failed attacks in the network). In detail, the red agent has a set of actions, one of which is randomly picked, to “attack” a node, move in the network or “do nothing”. The blue agent, instead, has a wider set of possible actions which he can perform: • Isolate: removes all edges of the node, cost 10; • Restore: returns the node to its original status (from “compromised” to “safe”), cost 1. The agent can be punished if patches a safe node or if there are too many infected nodes; 1 Autonomous Resilient Cyber Defence, https://www.gov.uk/government/news/autonomous-resilient-cyber-defence- intelligent-agents. 2 https://github.com/dstl/YAWNING-TITAN . • Make node safe: reduces the vulnerability3 of a node, cost 0.5. The agent can be punished in the same way as for Restore action; • Connect: reinstate all edges of the node, cost 0, the agent is rewarded by 5 points if it reduces the number of isolated nodes; • Add deceptive node: add an extra “fake” node between two nodes to slow the spread of the red agent4 , cost 8. The agent is also punished by 5 points if adds more deceptive nodes than allowed (in our case 3); • Do nothing: self explanatory, cost -0.5. The agent is punished by doing this action if there are a lot of infected nodes. The blue action space can be modified using a configuration file. The score is obtained from a a combination of the action costs plus the rewards obtained from removing red nodes, the penalties from the actions and final points from ending the game (winning or losing, ±100 points). The score is parameter the agents need to optimise. The network is the “gym” where the two agents interact and can either be loaded from an existing scenario or prompted by the user. There are multiple ways to describe a network: it is fundamental to map the specific connections across the nodes5 and with that knowledge, it is possible to generate an adjacency matrix that is read and interpreted by a graph-based Python library (Networkx6 ) which can interact using Pandas. A network is defined by nodes connected with edges, some of whose are defined as “entry nodes” which are the starting points where the red agent will begin its spread. It is possible to add a “high-value target”, a valuable asset inside the network, the red agent can target that specific node and, under certain game rules, can be the trigger for the endgame. In the network, the software generates each node and assigns a specific vulnerability, entry nodes tend to have a higher vulnerability score because are infection starting point. At the end of each simulation step, the vulnerabilities are evaluated in terms of the red and blue actions. At the beginning of each simulation, we have a safe network where every node is clean and the game parameters are set (entry and HVT nodes). 2. Algorithms We have considered algorithms from the library Stable-baseline3 which represent model- free RL algorithms where an “agent” learns to play by interacting with the environment. A trained agent has the knowledge of the states by performing actions, obtaining a reward (positive or negative) and their effect on the environment. The aim of the agent is to learn the best actions from a policy in order to maximise the total rewards across an episode, which is everything that happens between the first and the last state in the environment (considered like a timestep). We have considered online, model-free RL algorithms7 with well-documented applications across 3 The vulnerability score of a node is a metric that is used for evaluating the risk of the node to be attacked. Exposed nodes and nodes neighbouring, connected, to a compromised node have a higher vulnerability, meaning a higher probability of being infected. 4 Adding a deceptive node does not count as adding a node in the network. 5 True, for version 1.0.1. In more recent versions, e.g. 2.0.1b, the user can also draw the network. 6 https://networkx.org/documentation/stable/index.html. 7 In the case of offline methods we should create a dataset from the simulation and then train the agents on such data, without the live interactions between the agent and the environment. different simulations [16] and that have a discrete action space. YT environment describes the possible actions on a discrete space, which scales by the number of nodes and usable actions on each node. Other RL algorithm needs box-shaped actions space or images (e.g. in the cases of CNNP OLICIES) which is not straightforward to implement with YT. The policies considered are M LP P OLICIES, where we pass the state vectors of the network in our input model. These policies implement an actor-critic neural network using a multilayer perceptron (with 2 layers of 64 neurons). The algorithms we are considering are Proximal Policy Optimisation (or PPO [17]), Advantage Actor Critic (or A2C [18]) and Deep Q-Network (or DQN [19, 20]). PPO algorithm works using a policy gradient optimisation based on natural policy gradients. This algorithm is known to perform better in comparison to similar ones, because the training is more stable by avoiding broad policy updates, helping the convergence on an optimal solution and allowing enough time to recover from an action [21]. The algorithm is based on the optimisation of the policy objective function using a gradient descent (or ascent) and uses a “clipped” surrogate of the objective function which prevents too large policy updates. A2C is an Actor-Critic method based on temporal difference learning8 that represent the policy function independent of the value function. Our implementation is Advantage Actor Critic and comes from the Asynchronous Advantage Actor-Critic (A3C) without the asynchronous part. This algorithm, as PPO , uses the policy gradient to weigh the actions and reduce the variance by using a large number of samples (created by single agent exploring the action space) hoping that one of these will provide the true estimation. DQN is a deep reinforcement learning algorithm which uses Q-learning to learn the best action to take in the given state and a deep neural network (or convolutional neural network) is implemented to estimate the value of the Q-function. 2.1. Algorithms hyper-parameters exploration We consider exploring how the different algorithms react by changing some hyper-parameters such as the discount factor (𝛾) and the learning rate (lr). 𝛾 measures the rewards the agent has achieved in the past, present and future. An agent with 𝛾 = 0, only cares about his first reward (myopic approach), while if 𝛾 = 1, it is interested in all the future rewards. lr is a parameter that measures how often and, how quickly, the Q-values are updated, improving the steps toward the solution, smaller lr can slow the gradient descent while, a larger value, can fail to converge. In DQN algorithm, we have reduced the buffer size to 10000 (from 106 ), because in the largest network (>50 nodes) there was the risk of requiring more memory than allocated on the HPC9 cores. By doing so, we both achieved a faster convergence and assured us to not overload the computing nodes. There are studies (i.e., [22]) that demonstrate and compare various RL agents in different contexts by changing and tuning the various hyper-parameters. For the means of this paper, we have not fine-tuned our agents to perform in the networks because we wanted to have control 8 Temporal difference learning methods are a class of model-free reinforcement learning algorithms which learn by bootstrapping the current estimate of the value function. 9 High performance computing. The training of the agents were performed on HPC CPU cores at University of Liverpool computing facility. over the way the performances may differ. We have chosen the best models to deploy in realistic networks according to the testing on sample networks. 3. Simulation setup We have trained agents on a set of networks comprised of a small case of 18 nodes (25 edges), a medium one of 50 nodes (>250 edges) and the largest case of 100 nodes (>500 edges) with an increasing number of entry points (3, 5, 10). A trained agent on an 18 nodes network cannot be deployed onto a larger or smaller network because the dimension of the action space is bound to the possible states in the specific case. The simulation setup has a red agent that can spread only via connected nodes with 45% infection success rate and 15% chance of spreading from connected infected nodes, the endgame rule is that the red agent wins if it takes over 80% of the network, 500 timesteps are the target for a blue victory. The HVT is chosen randomly and furthest away from the entry points. We train and analyse the performances of the various algorithms on a set of example networks, using the same network for both training and evaluation, then, we test the best algorithms on a series of realistic network configurations after training on similar networks with the same amount of nodes. 3.1. Training the agents We perform the training of the various agents in the networks without any hyper-parameter tuning. Then, we train the agents by modifying just one parameter at a time, 𝛾 and lr on the same networks. We set up 5 × 105 timesteps and we consider the convergence when we have not measured any improvements of the average rewards for up to four consecutive evaluations (a single evaluation is the average of rewards over 50 timesteps). Table 1 We present the algorithms and their hyper-parameters modified in the training phase, the first uses the standard hyper-parameter values. We divide the training on the three network sizes (left, central and right column) showing the training time (in seconds) and the final score obtained in each case. 18 Nodes 50 Nodes 100 Nodes Algorithm Training Final Training Final Training Final time [s] score time [s] score time [s] score 𝛾 = 0.99, lr = 0.0003 1722 -130 3899 -108 7264 -120 𝛾 = 0.75 1702 -122 4375 -107 5972 -102 PPO lr = 0.001 1714 -117 4337 -120 6655 -100 𝛾 = 0.99, lr = 0.0007 2221 -99 3149 -331 13660 -558 𝛾 = 0.75 2235 -109 3467 -353 4540 -2220 A2C lr = 0.001 2221 -110 3463 -352 13700 -547 𝛾 = 0.99, lr = 0.0005 1655 -114 6800 -235 12389 -409 𝛾 = 0.75 1617 -121 5714 -283 14755 -400 DQN lr = 0.001 1641 -119 5120 -308 13852 -476 In table 1, we compare the training of the agents showing the training time (in seconds) and the final scores. All agents converge to an optimal solution before the end of the training, we find, also, that adding more nodes (larger action space) the time requirements are higher. The final score can give us an indication of the expected results during the trials, however, we should not be surprised by any different results. If we have to compare the training times: PPO seems to be the quickest to converge in all three network sizes. The A2C algorithm reaches the stability quite quickly, even if it is not the better performing at the end of the training session, the PPO agents tend to have steady and stable growth in performances during the training while DQN agents have a constant behaviour: in all training, they have very little gain during the initial steps, then they rapidly improve their performances taking over the A2C performances as well. 3.2. Deploy the trained agents In this section, after having presented the training, we compare the performances of the agents on the same seeded networks and we obtain the mean and standard deviation of the scores achieved (using Stable-baseline3 evaluator function EVALUATE _ POLICY). We compare the performances with the scores obtained by a random agent on the same networks, this agent randomly chooses a node and one action from the available. By testing the agents’ performances on the same seeded network we aim to evaluate the agents on a constrained set of examples of starting points, red agent actions, reducing the variability of the games and aiming to understand better how they behave. In figure 1, we compare the agents’ performances against the same networks. In diamonds we present the standard hyper-parameters setup, crosses for the algorithms with 𝛾 = 0.75 and in triangles the case with lr = 0.001. We show the PPO results using blue dots, orange for the A2C case and green for DQN algorithm. We can see that in the case of 18 nodes, all algorithms have similar and comparable performances (around -130 as mean reward). In the 50 nodes case, we measure a more significant difference in performances, in particular in the DQN case, almost 5 times lower than PPO results. Instead, in the 100 nodes case we measure a considerable lower score in the PPO case (in which the mean reward passes from a few hundred to -2000) while, on the opposite side, DQN shows higher scores. As stated earlier, we changed the buffer size of DQN , by doing so we measure a significant improvement in the performances highlighting a strong positive influence of this hyper-parameter. Both A2C and DQN algorithms, with a reduced discount factor, perform better than PPO in the largest network case. By increasing the network size, the agents have more opportunities to take action, and more chances of having negative rewards because the red agent is able to spread more. Therefore, the simulations are longer, and many of them resulted in the blue agent victory since the agent was able to slow the spread by making more expensive actions such as adding deceptive nodes. In cases of large networks, it is important to see also how spread are the final scores, measured by the standard deviation, and understand how an algorithm can overall perform. We want to extend this analysis while changing the starting conditions, in detail: add isolated nodes, compromised nodes, a mixture of isolated and compromised nodes, changing the number of edges and the red agent’s skills. In figure 2 and 3, we summarise the mean reward scores obtained by the various agents in the networks applying the proposed changes on the network. Isolated and compromised nodes Training performances 18 nodes 50 nodes 100 nodes 0 1000 Mean Rewards 2000 3000 4000 Legend Random agent =0.99, LR = 3 × 10 4 5000 =0.75 LR=0.001 PPO A2C DQN PPO A2C DQN PPO A2C DQN Figure 1: Agents’ performance comparison in the different networks (left panel 18 nodes, centre 50 nodes and 100 nodes right panel). We show the algorithms with standard hyper-parameters using diamonds, in crosses the case modifying 𝛾 and in triangles when modifying the lr. We use different colors to easily identify the algorithms: blue for PPO , orange for A2C and green for DQN . We show as well the 1𝜎 deviations of the scores with colored lines. The grey band is the random agent scores. were randomly chosen, therefore it may have happened that some nodes were both isolated and compromised at the same time. Changing the number of edges in the network creates, or removes, routes for the red agent to spread, but also allows the blue agent to defend better the network by adding more deceptive nodes or isolating cross-road nodes, reducing the effectiveness of the spread. Changing the red agent skills increases, or reduces, the simulation challenge level because a red agent that has a higher success rate spreads more quickly and it is more difficult to react to, on the other hand, a less effective red agent would leave more time for the agent to fix the nodes. We can see that, in most cases, tweaking the network (nodes isolated or compromised) does not result in a measurable change in the final scores. For instance, the PPO algorithm obtains a mean of 130 points in all three variations when we lower 𝛾. Considering the DQN algorithm, we find a similar picture in which even if the scores differ from the standard case, the variations trials have comparable results between them. Considering the explorations with fewer or more edges, we measure a noteworthy difference in the mean rewards between the PPO and the other algorithms. PPO agent, even with the variations, shows a lower score removing edges, while this is not the case for the other two agents, however the random agent performances are significantly lower. On the other hand, adding edges increases the variations for the random agent but the 0 500 1000 Mean Rewards 1500 2000 2500 Standard A2C n=18 3000 =0.75 lr=0.001 DQN RND PPO 3500 nd ard omisseolate mix edges edges -skill h-skill sta ompr i - + low hig c 1000 2000 Mean Rewards 3000 4000 n=50 5000 nd ard omisseolate mix edges edges -skill h-skill sta ompr i - + low hig c Figure 2: Mean rewards for the agents on the three networks (18 nodes top, 50 nodes bottom and 100 nodes in figure 3), in blue symbols we show the PPO results, A2C in orange and DQN in green. The three symbols show the different changes on the algorithm: the diamond is the standard version of the algorithm, the cross is using 𝛾 = 0.75 and the triangle is with lr = 0.001. The random agent (RND) is shown in grey with the mean value as dashed line and the grey area shows 1𝜎 deviation. The 1𝜎 deviation on the agents scores is shown using y-errorbars. On the x-axis we show the various extension tested as adding compromised nodes and adding or removing edges. 0 1000 2000 Mean Rewards 3000 4000 5000 n=100 6000 nd ard omisseolate mix edges edges -skill h-skill sta ompr i - + low hig c Figure 3: Same figure as 2, but for the 100 nodes case. three algorithms behave similarly. A similar conclusion can be found in the instances of red agent skills: a less effective red agent (low skill) makes the game longer resulting in scores built on top of more expensive actions and fewer rewards from fixing the network. A more effective red agent is more aggressive and spreads quickly and the blue agent gains more points in fixing the nodes, which results in shorter game lengths because it is more difficult to stop and prevent the final escalation. Indeed, this is also verified by the small spread of the results from the random agent. In the case of a highly skilled red agent the performances of the random agent and the trained ones are really similar, even if still distinguishable. In the 50 node scenario, bottom panel of figure 2, we see a significant difference between the various algorithms, PPO agent is always better performing compared to the others, being DQN agent the lowest among the three. Modifying the network result in similar spread of the scores even varying the parameters as 𝛾 and lr. In 100 nodes case, figure 3, we see that A2C with lr modified has similar scores with PPO agents results in almost all experiments, while DQN obtains significantly different results, in particular by modifying 𝛾. Interestingly, we do not see much difference when we modify the number of edges with all models scoring similar results. A2C agents are the worst performing, with results significantly lower than the other agents. Summarising the findings and analysis done in these cases we can say: • for given algorithm (and hyper-parameter choice) adding isolated or compromised nodes, does not change significantly the performances in comparison to a clean starting network; • the scores in larger networks are lower because the game length is generally higher, this happens because the red agent is not able to overtake the network quickly, therefore the blue agent has more time to do expensive actions, while on smaller networks it is more rewarding fixing the nodes even if that results in loss; • changing the network topology can trigger significant changes in the agents response. Adding more edges results in higher scores. However we measure similar performances between the agents and variations of hyper-parameters; • the red agent skills has a significant impact on the performances but more importantly has the same reaction on the blue agent, we understand by noting small differences between all scores (small standard deviations). Given this panoramic view of these results we can say that both PPO and A2C are performing well in small networks, in particular by changing the discount factor (𝛾), while adjusting the buffer size and 𝛾 in the DQN case conveys in performing well, and in a stable manner, in larger network cases. 3.3. Agent deployment on realistic networks In the previous sections, we have explored the performances of RL algorithms trained and tested onto the same networks, without any resemblance to reality, trying to understand the best algorithm and check their resilience in the changes. In this subsection, we focus on a sample of cases using realistic networks configurations for testing the agents’ behaviour. We consider three cases with 22, 55 and 60 nodes. We have extrapolated these networks from a portion of a larger existing network of computers considering only nodes connected, i.e. nodes that are connected between them but they do not share a connection with the core of the network are not considered. We deploy in the first instance an agent trained using an A2C algorithm, PPO with the standard setup for the 55 node network and DQN with 𝛾 = 0.75 in the latter. We train the agents on example networks with the same amount of nodes but with a different configuration (number of edges), and we evaluate the changes in performances in the realistic ones, which are effectively novel networks to the agents. In table 2, we summarise the network statistics for the synthetic networks used in training, the realistic environments in deployment and the algorithms selected. The average clustering measures how many connections are between the nodes: nodes connected with more edges have a value closer to 1. The triangles are a set of three nodes where each node has a relationship to the other two. These quantities can describe the complexity of the network, in particular it is clear that the realistic networks are much simpler (lower average clustering and number of triangles) in comparison to the synthetic ones. In figure 4, we compare the performances of the trained agents in five different scenarios on the network used for training and the realistic one, we show the random agent performances for comparison. As before, we test the agents while modifying the network by adding isolated nodes, compromised, a mixture of those and against a red agent with lower and higher skills level. In this figure the y-axis is in log-scale for easier comparison. The goal of this analysis is to understand how an agent trained on a different network performs in a realistic network without re-training. In the case of A2C , we can see that the scores in the training network are almost identical and, almost, two times higher in comparison to the ones obtained in the realistic scenario exploration. It is interesting to note that the scatter from the five realisations is small in the training networks, Table 2 We present the synthetic networks used for training and the realistic networks used in deployment. The network statistics are the number of nodes, edges, high-value targets (HVT) and entry nodes. The algorithms used are: A2C and PPO with standard hyper-parameters and DQN with 𝛾 = 0.75 and buffer size = 10000. Finally, we present the average clustering value and the number of triangles present in the network: the average clustering measures how much the nodes are closer and connected with edges (nodes more connected have this measure closer to one), and the triangles defined as three nodes where each one has a relationship to the other two. Mode #Nodes #Edges #HVT Entry Algorithm Average #Triangles nodes clustering Train 22 113 1 21 A2C 0.45 540 Deploy 22 21 1 21 A2C 0 0 Train 55 730 5 10 PPO 0.49 9500 Deploy 55 54 5 10 PPO 0 0 Train 60 901 4 12 DQN 0.5 13400 Deploy 60 62 4 12 DQN 0.045 6 while in the realistic ones there is a larger variance of results. This result maybe be connected to the peculiar shape of the realistic network and its relatively low number of routes the red agent can choose. The random agent performances are significantly lower in comparison to other agents. In the network with 55 nodes we use the PPO algorithm: we measure values close to -4500 (close to the random agent performances) in the training network for most of the different scenarios, while in the real network the scores are around -1500, even in the case of varying the red agent’s skill. We can conclude that the difference in the topology of the network has played a significant role in this analysis, overcoming as well, the impact of varying the red agents’ skill, which has shown a larger effect in the previous analysis. We have deployed the DQN algorithm in the final case considered. We find a similar behaviour we have seen in the 22 nodes case, with scores in the training network really close (around -300) one to the other and higher in comparison to the realistic scenario (close to -1000). In the realistic network we measure a larger variability in the final scores, again, we measure that the largest impact on the scores is due to the red agents’ skills. The random agent scores are significantly lower in all tests. In light of these results, we can state that the YT framework allows the training of RL agents and their deployment is transferable from synthetic to realistic networks with minimal loss of performance. In particular, we can validate that the agents’ scores are lower in comparison to the ones obtained from agents trained and evaluated on the same networks, however, these are still much more improved from random scores. This analysis highlights that changing the networks’ topology has not invalidated the performances of such agents. 4. Conclusion and future work In this paper, we have performed training and evaluation of RL agents in a set of networks using Yawning Titan ACO capabilities, comparing how their performances change by modifying the A2C, 22 nodes Training 102 103 PPO, 55 nodes Real Network Mean Rewards 103 102 DQN, 60 nodes Random Agent 103 n da rd omise olate is mix low hig h s t a m p r co Figure 4: Agents’ performance comparison between training networks and realistic networks. In the top panel we present the case with 22 nodes using A2C algorithm, in the central panel the 55 nodes network using PPO and finally, in the bottom panel we present the DQN agent applied to a 60 node network. The green crosses are the random agent scores, blue diamonds are the training scores and the realistic cases are in orange crosses. Please note that the y-axis, differently from the previous figures, is in log-scale for easier comparison. status of the network and methods hyper-parameters looking for the best algorithm to deploy in realistic networks. The main findings are that by increasing the number of nodes, the mean reward per simulation is lower, highlighting a positive correlation with the action space dimension and varying the red agent’s skills has the larger impact on the results. We did not measure significant differences in the scores while modifying the status of the nodes (being compromised or isolated), on the other hand adding (or removing) edges in the network and augmenting (or reducing) the red agent skills showed interesting differences in the performances. We find that the RL algorithms considered can react well to network changes by measuring the level of performances across the various tests. By exploring the hyper-parameters tuning, the discount factor (𝛾) seems to have the most positive impact in the training and evaluation processes, in comparison to the limited results obtained by changing the learning rate (lr). This work has shown and proved the possibility of using Yawning Titan in training agents that could be considered in realistic cyber-defence environments with minimal computational require- ments. The tests we have carried out have shown that the changes in the agents’ performances were arising from the different network topologies and not from changes in the network status itself. We have shown, as well, that it is possible to deploy an agent trained on a different topology with minimal loss of performance, and in some cases (e.g., 50 nodes networks) we have measured an improvement in the mean scores. These results show the possibility to train intelligent agents in synthetic networks and deploy such agents in realistic networks without re-training. However, little exploration has been done in modifying or exploiting the current rewards of actions inside the simulations, exploring different winning setups (e.g., allowing the end game if the high-value target is taken) and more complex scenarios (e.g., more red agents, complex decision making) and other Markov decision process algorithms. Extension of the current work can be exploring algorithms with proper hyper-parameter tuning, exploitation of the current reward scheme, offline learning methods and inclusion of multi-agent algorithms and time evolving networks. Acknowledgments The authors thank the reviewers for their useful comments that improved the quality of the paper and thank the useful discussions with Neil Dhir. This project was financially supported by a contract with the Alan Turing Institute. References [1] V. Krishna Viraja, P. Purandare, A Qualitative Research on the Impact and Challenges of Cybercrimes, in: Journal of Physics Conference Series, volume 1964 of Journal of Physics Conference Series, 2021, p. 042004. doi:10.1088/1742-6596/1964/4/042004. [2] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, H. A. Najada, Survey of review spam detection using machine learning techniques, Journal of Big Data 2 (2015). [3] L. Xiao, X. Wan, X. Lu, Y. Zhang, D. Wu, Iot security techniques based on machine learning: How do iot devices use ai to enhance security?, IEEE Signal Processing Magazine 35 (2018) 41–49. doi:10.1109/MSP.2018.2825478. [4] A. L. Buczak, E. Guven, A survey of data mining and machine learning methods for cyber security intrusion detection, IEEE Communications Surveys & Tutorials 18 (2016) 1153–1176. doi:10.1109/COMST.2015.2494502. [5] Y. Huang, L. Huang, Q. Zhu, Reinforcement learning for feedback-enabled cyber resilience, 2021. arXiv:2107.00783. [6] S. Vyas, J. Hannay, A. Bolton, P. P. Burnap, Automated Cyber Defence: A Re- view, arXiv e-prints (2023) arXiv:2303.04926. doi:10.48550/arXiv.2303.04926. arXiv:2303.04926. [7] W. Wang, D. Sun, F. Jiang, X. Chen, C. Zhu, Research and challenges of reinforcement learning in cyber defense decision-making for intranet security, Algorithms 15 (2022). URL: https://www.mdpi.com/1999-4893/15/4/134. doi:10.3390/a15040134. [8] A. Burke, Robust artificial intelligence for active cyber defence, Alan Turing Insitute. (2017). [9] R. Buettner, D. Sauter, J. Klopfer, J. Breitenbach, H. Baumgartl, A review of recent advances in machine learning approaches for cyber defense, in: 2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 3969–3974. doi:10.1109/BigData52589.2021. 9671918. [10] M. Standen, M. Lucas, D. Bowman, T. J. Richer, J. Kim, D. Marriott, Cyborg: A gym for the development of autonomous cyber agents, arXiv preprint arXiv:2108.09118 (2021). [11] CAGE Challenge 1, arXiv, 2021. [12] TTCP CAGE Challenge 2, 2022. [13] T. C. W. Group, Ttcp cage challenge 3, https://github.com/cage-challenge/cage-challenge-3, 2022. [14] A. Molina-Markham, C. Miniter, B. Powell, A. Ridley, Network environment design for autonomous cyberdefense, 2021. arXiv:2103.07583. [15] A. Andrew, S. Spillard, J. Collyer, N. Dhir, Developing optimal causal cyber-defence agents via cyber security simulation, in: Workshop on Machine Learning for Cybersecurity (ML4Cyber), 2022. [16] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research 22 (2021) 1–8. URL: http://jmlr.org/papers/v22/20-1364.html. [17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017). [18] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, PMLR, 2016, pp. 1928–1937. [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing atari with deep reinforcement learning, arXiv preprint arXiv:1312.5602 (2013). [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, nature 518 (2015) 529–533. [21] B. Liu, Q. Cai, Z. Yang, Z. Wang, Neural proximal/trust region policy optimization attains globally optimal policy, arXiv preprint arXiv:1906.10306 (2019). [22] M. Andrychowicz, A. Raichuk, P. Stańczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, S. Gelly, O. Bachem, What Matters In On- Policy Reinforcement Learning? A Large-Scale Empirical Study, arXiv e-prints (2020) arXiv:2006.05990. doi:10.48550/arXiv.2006.05990. arXiv:2006.05990.