Defending the unknown: Exploring reinforcement learning agents' deployment in realistic, unseen networks

Defending the unknown: Exploring reinforcement learning agents' deployment in realistic, unseen networks AlbertoAcuto a.acuto@liverpool.ac.uk School of Electrical Engineering, Electronics and Computer Science The University of Liverpool

Brownlow Hill L69 3GJ Liverpool

SimonMaskell s.maskell@liverpool.ac.uk School of Electrical Engineering, Electronics and Computer Science The University of Liverpool

Brownlow Hill L69 3GJ Liverpool

The Alan Turing Institute Defending the unknown: Exploring reinforcement learning agents' deployment in realistic, unseen networks 1613-0073 448062D7AED5C1983A162E31284A52FF GROBID - A machine learning software for extracting information from scholarly documents Reinforcement Learning Cyber security Network simulation

The increasing number of network simulators has opened opportunities to explore and apply state-ofthe-art algorithms to understand and measure the capabilities of such techniques in numerous sectors. In this regard, the recently released Yawning Titan is one example of a simplistic, but not less detailed, representation of a cyber network scenario where it is possible to train agents guided by reinforcement learning algorithms and measure their effectiveness in trying to stop an infection. In this paper, we explore how different reinforcement learning algorithms lead the training of various agents in different examples and realistic networks. We assess how we can deploy such agents in a set of networks, focusing in particular on the resilience of the agents in exploring networks with complex starting states, increased number of routes connecting the nodes and different levels of challenge, aiming to evaluate the deployment performances in realistic networks never seen before.

Introduction

The development of autonomous resilient agents in the context of automated cyber defence (ACD) to counteract the actions of external or malevolent actors is becoming a pivotal research topic from both academy and governmental agencies. In recent years cyber crimes have increased their presence in the day-to-day life of organisations and governmental institutions, and research in automated cyber defence is one of the most developed topics [1]. Novel technologies such as machine learning (ML) and reinforcement learning (RL) are increasingly employed for both defence and attack thanks to their adaptability, cyber resilience and variety of applications. Some defensive examples are ML applications in spam detection [2], malware and intrusion detection [3,4], offensive applications can relate to the deployment of algorithms to exploit vulnerabilities of infrastructures and limit the visibility and extend duration (or frequency) of

Yawning Titan

Yawning Titan is a graph-based cyber-security simulation environment that allows the training of intelligent agents to counteract the actions of a red enemy agent that aims to spread into the network. The YT setup specifies the red and blue agents' capabilities (e.g. the usable actions and success rate) and the network's description: connections between the nodes, entry nodes and the presence and location of a "high-value target" (HVT). Each agent has a set of parameters describing the probability of success of the actions and game rules (i.e., how the red agent can spread from a compromised node or if the blue agent can detect failed attacks in the network). In detail, the red agent has a set of actions, one of which is randomly picked, to "attack" a node, move in the network or "do nothing". The blue agent, instead, has a wider set of possible actions which he can perform:

• Isolate: removes all edges of the node, cost 10; • Restore: returns the node to its original status (from "compromised" to "safe"), cost 1. The agent can be punished if patches a safe node or if there are too many infected nodes;

• Make node safe: reduces the vulnerability3 of a node, cost 0.5. The agent can be punished in the same way as for Restore action; • Connect: reinstate all edges of the node, cost 0, the agent is rewarded by 5 points if it reduces the number of isolated nodes; • Add deceptive node: add an extra "fake" node between two nodes to slow the spread of the red agent 4 , cost 8. The agent is also punished by 5 points if adds more deceptive nodes than allowed (in our case 3); • Do nothing: self explanatory, cost -0.5. The agent is punished by doing this action if there are a lot of infected nodes.

The blue action space can be modified using a configuration file. The score is obtained from a a combination of the action costs plus the rewards obtained from removing red nodes, the penalties from the actions and final points from ending the game (winning or losing, ±100 points). The score is parameter the agents need to optimise. The network is the "gym" where the two agents interact and can either be loaded from an existing scenario or prompted by the user. There are multiple ways to describe a network: it is fundamental to map the specific connections across the nodes 5 and with that knowledge, it is possible to generate an adjacency matrix that is read and interpreted by a graph-based Python library (Networkx6 ) which can interact using Pandas. A network is defined by nodes connected with edges, some of whose are defined as "entry nodes" which are the starting points where the red agent will begin its spread. It is possible to add a "high-value target", a valuable asset inside the network, the red agent can target that specific node and, under certain game rules, can be the trigger for the endgame. In the network, the software generates each node and assigns a specific vulnerability, entry nodes tend to have a higher vulnerability score because are infection starting point. At the end of each simulation step, the vulnerabilities are evaluated in terms of the red and blue actions. At the beginning of each simulation, we have a safe network where every node is clean and the game parameters are set (entry and HVT nodes).

Algorithms

We have considered algorithms from the library Stable-baseline3 which represent modelfree RL algorithms where an "agent" learns to play by interacting with the environment. A trained agent has the knowledge of the states by performing actions, obtaining a reward (positive or negative) and their effect on the environment. The aim of the agent is to learn the best actions from a policy in order to maximise the total rewards across an episode, which is everything that happens between the first and the last state in the environment (considered like a timestep). We have considered online, model-free RL algorithms 7 with well-documented applications across different simulations [16] and that have a discrete action space. YT environment describes the possible actions on a discrete space, which scales by the number of nodes and usable actions on each node. Other RL algorithm needs box-shaped actions space or images (e.g. in the cases of CNNPOLICIES) which is not straightforward to implement with YT. The policies considered are MLPPOLICIES, where we pass the state vectors of the network in our input model. These policies implement an actor-critic neural network using a multilayer perceptron (with 2 layers of 64 neurons). The algorithms we are considering are Proximal Policy Optimisation (or PPO [17]), Advantage Actor Critic (or A2C [18]) and Deep Q-Network (or DQN [19,20]). PPO algorithm works using a policy gradient optimisation based on natural policy gradients. This algorithm is known to perform better in comparison to similar ones, because the training is more stable by avoiding broad policy updates, helping the convergence on an optimal solution and allowing enough time to recover from an action [21]. The algorithm is based on the optimisation of the policy objective function using a gradient descent (or ascent) and uses a "clipped" surrogate of the objective function which prevents too large policy updates.

A2C is an Actor-Critic method based on temporal difference learning8 that represent the policy function independent of the value function. Our implementation is Advantage Actor Critic and comes from the Asynchronous Advantage Actor-Critic (A3C) without the asynchronous part. This algorithm, as PPO , uses the policy gradient to weigh the actions and reduce the variance by using a large number of samples (created by single agent exploring the action space) hoping that one of these will provide the true estimation.

DQN is a deep reinforcement learning algorithm which uses Q-learning to learn the best action to take in the given state and a deep neural network (or convolutional neural network) is implemented to estimate the value of the Q-function.

Algorithms hyper-parameters exploration

We consider exploring how the different algorithms react by changing some hyper-parameters such as the discount factor (𝛾) and the learning rate (lr). 𝛾 measures the rewards the agent has achieved in the past, present and future. An agent with 𝛾 = 0, only cares about his first reward (myopic approach), while if 𝛾 = 1, it is interested in all the future rewards. lr is a parameter that measures how often and, how quickly, the Q-values are updated, improving the steps toward the solution, smaller lr can slow the gradient descent while, a larger value, can fail to converge.

In DQN algorithm, we have reduced the buffer size to 10000 (from 10 6 ), because in the largest network (>50 nodes) there was the risk of requiring more memory than allocated on the HPC9 cores. By doing so, we both achieved a faster convergence and assured us to not overload the computing nodes.

There are studies (i.e., [22]) that demonstrate and compare various RL agents in different contexts by changing and tuning the various hyper-parameters. For the means of this paper, we have not fine-tuned our agents to perform in the networks because we wanted to have control over the way the performances may differ. We have chosen the best models to deploy in realistic networks according to the testing on sample networks.

Simulation setup

We have trained agents on a set of networks comprised of a small case of 18 nodes (25 edges), a medium one of 50 nodes (>250 edges) and the largest case of 100 nodes (>500 edges) with an increasing number of entry points (3,5,10). A trained agent on an 18 nodes network cannot be deployed onto a larger or smaller network because the dimension of the action space is bound to the possible states in the specific case.

The simulation setup has a red agent that can spread only via connected nodes with 45% infection success rate and 15% chance of spreading from connected infected nodes, the endgame rule is that the red agent wins if it takes over 80% of the network, 500 timesteps are the target for a blue victory. The HVT is chosen randomly and furthest away from the entry points. We train and analyse the performances of the various algorithms on a set of example networks, using the same network for both training and evaluation, then, we test the best algorithms on a series of realistic network configurations after training on similar networks with the same amount of nodes.

Training the agents

We perform the training of the various agents in the networks without any hyper-parameter tuning. Then, we train the agents by modifying just one parameter at a time, 𝛾 and lr on the same networks. We set up 5 × 10 5 timesteps and we consider the convergence when we have not measured any improvements of the average rewards for up to four consecutive evaluations (a single evaluation is the average of rewards over 50 timesteps).

Table 1

We present the algorithms and their hyper-parameters modified in the training phase, the first uses the standard hyper-parameter values. We divide the training on the three network sizes (left, central and right column) showing the training time (in seconds) and the final score obtained in each case. 18 In table 1, we compare the training of the agents showing the training time (in seconds) and the final scores. All agents converge to an optimal solution before the end of the training, we find, also, that adding more nodes (larger action space) the time requirements are higher. The final score can give us an indication of the expected results during the trials, however, we should not be surprised by any different results. If we have to compare the training times: PPO seems to be the quickest to converge in all three network sizes. The A2C algorithm reaches the stability quite quickly, even if it is not the better performing at the end of the training session, the PPO agents tend to have steady and stable growth in performances during the training while DQN agents have a constant behaviour: in all training, they have very little gain during the initial steps, then they rapidly improve their performances taking over the A2C performances as well.

Deploy the trained agents

In this section, after having presented the training, we compare the performances of the agents on the same seeded networks and we obtain the mean and standard deviation of the scores achieved (using Stable-baseline3 evaluator function EVALUATE_POLICY). We compare the performances with the scores obtained by a random agent on the same networks, this agent randomly chooses a node and one action from the available. By testing the agents' performances on the same seeded network we aim to evaluate the agents on a constrained set of examples of starting points, red agent actions, reducing the variability of the games and aiming to understand better how they behave.

In figure 1, we compare the agents' performances against the same networks. In diamonds we present the standard hyper-parameters setup, crosses for the algorithms with 𝛾 = 0.75 and in triangles the case with lr = 0.001. We show the PPO results using blue dots, orange for the A2C case and green for DQN algorithm. We can see that in the case of 18 nodes, all algorithms have similar and comparable performances (around -130 as mean reward). In the 50 nodes case, we measure a more significant difference in performances, in particular in the DQN case, almost 5 times lower than PPO results. Instead, in the 100 nodes case we measure a considerable lower score in the PPO case (in which the mean reward passes from a few hundred to -2000) while, on the opposite side, DQN shows higher scores. As stated earlier, we changed the buffer size of DQN , by doing so we measure a significant improvement in the performances highlighting a strong positive influence of this hyper-parameter. Both A2C and DQN algorithms, with a reduced discount factor, perform better than PPO in the largest network case. By increasing the network size, the agents have more opportunities to take action, and more chances of having negative rewards because the red agent is able to spread more. Therefore, the simulations are longer, and many of them resulted in the blue agent victory since the agent was able to slow the spread by making more expensive actions such as adding deceptive nodes. In cases of large networks, it is important to see also how spread are the final scores, measured by the standard deviation, and understand how an algorithm can overall perform.

We want to extend this analysis while changing the starting conditions, in detail: add isolated nodes, compromised nodes, a mixture of isolated and compromised nodes, changing the number of edges and the red agent's skills. In figure 2 We show the algorithms with standard hyper-parameters using diamonds, in crosses the case modifying 𝛾 and in triangles when modifying the lr. We use different colors to easily identify the algorithms: blue for PPO , orange for A2C and green for DQN . We show as well the 1𝜎 deviations of the scores with colored lines. The grey band is the random agent scores.

were randomly chosen, therefore it may have happened that some nodes were both isolated and compromised at the same time. Changing the number of edges in the network creates, or removes, routes for the red agent to spread, but also allows the blue agent to defend better the network by adding more deceptive nodes or isolating cross-road nodes, reducing the effectiveness of the spread. Changing the red agent skills increases, or reduces, the simulation challenge level because a red agent that has a higher success rate spreads more quickly and it is more difficult to react to, on the other hand, a less effective red agent would leave more time for the agent to fix the nodes.

We can see that, in most cases, tweaking the network (nodes isolated or compromised) does not result in a measurable change in the final scores. For instance, the PPO algorithm obtains a mean of 130 points in all three variations when we lower 𝛾. Considering the DQN algorithm, we find a similar picture in which even if the scores differ from the standard case, the variations trials have comparable results between them. Considering the explorations with fewer or more edges, we measure a noteworthy difference in the mean rewards between the PPO and the other algorithms. PPO agent, even with the variations, shows a lower score removing edges, while this is not the case for the other two agents, however the random agent performances are significantly lower. On the other hand, adding edges increases the variations for the random agent but the three algorithms behave similarly. A similar conclusion can be found in the instances of red agent skills: a less effective red agent (low skill) makes the game longer resulting in scores built on top of more expensive actions and fewer rewards from fixing the network. A more effective red agent is more aggressive and spreads quickly and the blue agent gains more points in fixing the nodes, which results in shorter game lengths because it is more difficult to stop and prevent the final escalation. Indeed, this is also verified by the small spread of the results from the random agent.

s t a n d a r d c o m p r o m i s e i s o l a t e m i xe d g e s + e d g e s l o w -s k i l l h i g h -s k i

In the case of a highly skilled red agent the performances of the random agent and the trained ones are really similar, even if still distinguishable.

In the 50 node scenario, bottom panel of figure 2, we see a significant difference between the various algorithms, PPO agent is always better performing compared to the others, being DQN agent the lowest among the three. Modifying the network result in similar spread of the scores even varying the parameters as 𝛾 and lr.

In 100 nodes case, figure 3, we see that A2C with lr modified has similar scores with PPO agents results in almost all experiments, while DQN obtains significantly different results, in particular by modifying 𝛾. Interestingly, we do not see much difference when we modify the number of edges with all models scoring similar results. A2C agents are the worst performing, with results significantly lower than the other agents. Summarising the findings and analysis done in these cases we can say:

• for given algorithm (and hyper-parameter choice) adding isolated or compromised nodes, does not change significantly the performances in comparison to a clean starting network; • the scores in larger networks are lower because the game length is generally higher, this happens because the red agent is not able to overtake the network quickly, therefore the blue agent has more time to do expensive actions, while on smaller networks it is more rewarding fixing the nodes even if that results in loss;

• changing the network topology can trigger significant changes in the agents response.

Adding more edges results in higher scores. However we measure similar performances between the agents and variations of hyper-parameters; • the red agent skills has a significant impact on the performances but more importantly has the same reaction on the blue agent, we understand by noting small differences between all scores (small standard deviations).

Given this panoramic view of these results we can say that both PPO and A2C are performing well in small networks, in particular by changing the discount factor (𝛾), while adjusting the buffer size and 𝛾 in the DQN case conveys in performing well, and in a stable manner, in larger network cases.

Agent deployment on realistic networks

In the previous sections, we have explored the performances of RL algorithms trained and tested onto the same networks, without any resemblance to reality, trying to understand the best algorithm and check their resilience in the changes. In this subsection, we focus on a sample of cases using realistic networks configurations for testing the agents' behaviour. We consider three cases with 22, 55 and 60 nodes. We have extrapolated these networks from a portion of a larger existing network of computers considering only nodes connected, i.e. nodes that are connected between them but they do not share a connection with the core of the network are not considered. We deploy in the first instance an agent trained using an A2C algorithm, PPO with the standard setup for the 55 node network and DQN with 𝛾 = 0.75 in the latter. We train the agents on example networks with the same amount of nodes but with a different configuration (number of edges), and we evaluate the changes in performances in the realistic ones, which are effectively novel networks to the agents. In table 2, we summarise the network statistics for the synthetic networks used in training, the realistic environments in deployment and the algorithms selected. The average clustering measures how many connections are between the nodes: nodes connected with more edges have a value closer to 1. The triangles are a set of three nodes where each node has a relationship to the other two. These quantities can describe the complexity of the network, in particular it is clear that the realistic networks are much simpler (lower average clustering and number of triangles) in comparison to the synthetic ones.

In figure 4, we compare the performances of the trained agents in five different scenarios on the network used for training and the realistic one, we show the random agent performances for comparison. As before, we test the agents while modifying the network by adding isolated nodes, compromised, a mixture of those and against a red agent with lower and higher skills level. In this figure the y-axis is in log-scale for easier comparison. The goal of this analysis is to understand how an agent trained on a different network performs in a realistic network without re-training.

In the case of A2C , we can see that the scores in the training network are almost identical and, almost, two times higher in comparison to the ones obtained in the realistic scenario exploration. It is interesting to note that the scatter from the five realisations is small in the training networks,

Table 2

We present the synthetic networks used for training and the realistic networks used in deployment. The network statistics are the number of nodes, edges, high-value targets (HVT) and entry nodes. The algorithms used are: A2C and PPO with standard hyper-parameters and DQN with 𝛾 = 0.75 and buffer size = 10000. Finally, we present the average clustering value and the number of triangles present in the network: the average clustering measures how much the nodes are closer and connected with edges (nodes more connected have this measure closer to one), and the triangles defined as three nodes where each one has a relationship to the other two. while in the realistic ones there is a larger variance of results. This result maybe be connected to the peculiar shape of the realistic network and its relatively low number of routes the red agent can choose. The random agent performances are significantly lower in comparison to other agents.

Mode

In the network with 55 nodes we use the PPO algorithm: we measure values close to -4500 (close to the random agent performances) in the training network for most of the different scenarios, while in the real network the scores are around -1500, even in the case of varying the red agent's skill. We can conclude that the difference in the topology of the network has played a significant role in this analysis, overcoming as well, the impact of varying the red agents' skill, which has shown a larger effect in the previous analysis.

We have deployed the DQN algorithm in the final case considered. We find a similar behaviour we have seen in the 22 nodes case, with scores in the training network really close (around -300) one to the other and higher in comparison to the realistic scenario (close to -1000). In the realistic network we measure a larger variability in the final scores, again, we measure that the largest impact on the scores is due to the red agents' skills. The random agent scores are significantly lower in all tests.

In light of these results, we can state that the YT framework allows the training of RL agents and their deployment is transferable from synthetic to realistic networks with minimal loss of performance. In particular, we can validate that the agents' scores are lower in comparison to the ones obtained from agents trained and evaluated on the same networks, however, these are still much more improved from random scores. This analysis highlights that changing the networks' topology has not invalidated the performances of such agents.

Conclusion and future work

In this paper, we have performed training and evaluation of RL agents in a set of networks using Yawning Titan ACO capabilities, comparing how their performances change by modifying the In the top panel we present the case with 22 nodes using A2C algorithm, in the central panel the 55 nodes network using PPO and finally, in the bottom panel we present the DQN agent applied to a 60 node network. The green crosses are the random agent scores, blue diamonds are the training scores and the realistic cases are in orange crosses. Please note that the y-axis, differently from the previous figures, is in log-scale for easier comparison.

status of the network and methods hyper-parameters looking for the best algorithm to deploy in realistic networks. The main findings are that by increasing the number of nodes, the mean reward per simulation is lower, highlighting a positive correlation with the action space dimension and varying the red agent's skills has the larger impact on the results. We did not measure significant differences in the scores while modifying the status of the nodes (being compromised or isolated), on the other hand adding (or removing) edges in the network and augmenting (or reducing) the red agent skills showed interesting differences in the performances. We find that the RL algorithms considered can react well to network changes by measuring the level of performances across the various tests. By exploring the hyper-parameters tuning, the discount factor (𝛾) seems to have the most positive impact in the training and evaluation processes, in comparison to the limited results obtained by changing the learning rate (lr).

This work has shown and proved the possibility of using Yawning Titan in training agents that could be considered in realistic cyber-defence environments with minimal computational requirements. The tests we have carried out have shown that the changes in the agents' performances were arising from the different network topologies and not from changes in the network status itself. We have shown, as well, that it is possible to deploy an agent trained on a different topology with minimal loss of performance, and in some cases (e.g., 50 nodes networks) we have measured an improvement in the mean scores. These results show the possibility to train intelligent agents in synthetic networks and deploy such agents in realistic networks without re-training. However, little exploration has been done in modifying or exploiting the current rewards of actions inside the simulations, exploring different winning setups (e.g., allowing the end game if the high-value target is taken) and more complex scenarios (e.g., more red agents, complex decision making) and other Markov decision process algorithms.

Extension of the current work can be exploring algorithms with proper hyper-parameter tuning, exploitation of the current reward scheme, offline learning methods and inclusion of multi-agent algorithms and time evolving networks.

and 3, we summarise the mean reward scores obtained by the various agents in the networks applying the proposed changes on the network. Isolated and compromised nodesPPO A2C DQN PPO A2C DQN PPO A2C

Figure 1 :1Figure 1:Agents' performance comparison in the different networks (left panel 18 nodes, centre 50 nodes and 100 nodes right panel). We show the algorithms with standard hyper-parameters using diamonds, in crosses the case modifying 𝛾 and in triangles when modifying the lr. We use different colors to easily identify the algorithms: blue for PPO , orange for A2C and green for DQN . We show as well the 1𝜎 deviations of the scores with colored lines. The grey band is the random agent scores.

Figure 2 :Figure 3 :23Figure 2:Mean rewards for the agents on the three networks (18 nodes top, 50 nodes bottom and 100 nodes in figure3), in blue symbols we show the PPO results, A2C in orange and DQN in green. The three symbols show the different changes on the algorithm: the diamond is the standard version of the algorithm, the cross is using 𝛾 = 0.75 and the triangle is with lr = 0.001. The random agent (RND) is shown in grey with the mean value as dashed line and the grey area shows 1𝜎 deviation. The 1𝜎 deviation on the agents scores is shown using y-errorbars. On the x-axis we show the various extension tested as adding compromised nodes and adding or removing edges.

Figure 4 :4Figure 4: Agents' performance comparison between training networks and realistic networks.In the top panel we present the case with 22 nodes using A2C algorithm, in the central panel the 55 nodes network using PPO and finally, in the bottom panel we present the DQN agent applied to a 60 node network. The green crosses are the random agent scores, blue diamonds are the training scores and the realistic cases are in orange crosses. Please note that the y-axis, differently from the previous figures, is in log-scale for easier comparison. Autonomous Resilient Cyber Defence, https://www.gov.uk/government/news/autonomous-resilient-cyber-defenceintelligent-agents. 2 https://github.com/dstl/YAWNING-TITAN . The vulnerability score of a node is a metric that is used for evaluating the risk of the node to be attacked. Exposed nodes and nodes neighbouring, connected, to a compromised node have a higher vulnerability, meaning a higher probability of being infected. Adding a deceptive node does not count as adding a node in the network. True, for version 1.0.1. In more recent versions, e.g. 2.0.1b, the user can also draw the network. https://networkx.org/documentation/stable/index.html. In the case of offline methods we should create a dataset from the simulation and then train the agents on such data, without the live interactions between the agent and the environment. Temporal difference learning methods are a class of model-free reinforcement learning algorithms which learn by bootstrapping the current estimate of the value function. High performance computing. The training of the agents were performed on HPC CPU cores at University of Liverpool computing facility.

Acknowledgments

The authors thank the reviewers for their useful comments that improved the quality of the paper and thank the useful discussions with Neil Dhir. This project was financially supported by a contract with the Alan Turing Institute.

A Qualitative Research on the Impact and Challenges of Cybercrimes VKrishnaViraja PPurandare 10.1088/1742-6596/1964/4/042004 Journal of Physics Conference Series 42004 1964. 2021 Journal of Physics Conference Series Survey of review spam detection using machine learning techniques MCrawford TMKhoshgoftaar JDPrusa ANRichter HANajada Journal of Big Data 2 2015 Iot security techniques based on machine learning: How do iot devices use ai to enhance security? LXiao XWan XLu YZhang DWu 10.1109/MSP.2018.2825478 IEEE Signal Processing Magazine 35 2018 A survey of data mining and machine learning methods for cyber security intrusion detection ALBuczak EGuven 10.1109/COMST.2015.2494502 IEEE Communications Surveys & Tutorials 18 2016 YHuang LHuang QZhu arXiv:2107.00783 Reinforcement learning for feedback-enabled cyber resilience 2021 SVyas JHannay ABolton PPBurnap 10.48550/arXiv.2303.04926 arXiv:2303.04926 arXiv:2303.04926 Automated Cyber Defence: A Review 2023 Research and challenges of reinforcement learning in cyber defense decision-making for intranet security WWang DSun FJiang XChen CZhu 10.3390/a15040134 Algorithms 15 2022 Robust artificial intelligence for active cyber defence ABurke 2017 Alan Turing Insitute A review of recent advances in machine learning approaches for cyber defense RBuettner DSauter JKlopfer JBreitenbach HBaumgartl 10.1109/BigData52589.2021.9671918 2021 IEEE International Conference on Big Data (Big Data) 2021 MStanden MLucas DBowman TJRicher JKim DMarriott arXiv:2108.09118 Cyborg: A gym for the development of autonomous cyber agents 2021 arXiv preprint arXiv CAGE Challenge 2021 1 </analytic> <monogr> <title level="j">TTCP CAGE Challenge 2 2022 TC WGroup Ttcp cage challenge 3 2022 AMolina-Markham CMiniter BPowell ARidley arXiv:2103.07583 Network environment design for autonomous cyberdefense 2021 Developing optimal causal cyber-defence agents via cyber security simulation AAndrew SSpillard JCollyer NDhir Workshop on Machine Learning for Cybersecurity (ML4Cyber) 2022 Stable-baselines3: Reliable reinforcement learning implementations ARaffin AHill AGleave AKanervisto MErnestus NDormann Journal of Machine Learning Research 22 2021 JSchulman FWolski PDhariwal ARadford OKlimov arXiv:1707.06347 Proximal policy optimization algorithms 2017 arXiv preprint Asynchronous methods for deep reinforcement learning VMnih APBadia MMirza AGraves TLillicrap THarley DSilver KKavukcuoglu International conference on machine learning

PMLR

2016 VMnih KKavukcuoglu DSilver AGraves IAntonoglou DWierstra MRiedmiller arXiv:1312.5602 Playing atari with deep reinforcement learning 2013 arXiv preprint Human-level control through deep reinforcement learning VMnih KKavukcuoglu DSilver AARusu JVeness MGBellemare AGraves MRiedmiller AKFidjeland GOstrovski nature 518 2015 BLiu QCai ZYang ZWang arXiv:1906.10306 Neural proximal/trust region policy optimization attains globally optimal policy 2019 arXiv preprint MAndrychowicz ARaichuk PStańczyk MOrsini SGirgin RMarinier LHussenot MGeist OPietquin MMichalski SGelly OBachem 10.48550/arXiv.2006.05990 arXiv:2006.05990 arXiv:2006.05990 What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study 2020