-

1613-0073

Reinforcement Learning

Riddam Rishu

rrishu@deloitte.com 0

Akshay Kakkar

akshkakkar@deloitte.com 0

Cheng Wang

chengwang@deloitte.com 0

Abdul Rahman

abdulrahman@deloitte.com 0

Christopher Redino

credino@deloitte.com 0

Dhruv Nandakumar

dnandakumar@deloitte.com 0

Tyler Cody

tcody@vt.edu 0 1

Ryan Clark

ryanclark4@deloitte.com 0

Daniel Radke

dradke@deloitte.com 0

Edward Bowen

edbowen@deloitte.com 0 0 CAMLIS'23: Conference on Applied Machine Learning for Information Security 1 National Security Institute , Virginia Tech , USA

Building on previous work using reinforcement learning (RL) focused on identification of exfiltration paths, this work expands the methodology to include protocol and payload considerations. The former approach to exfiltration path discovery, where reward and state are associated specifically with the determination of optimal paths, are presented with these additional realistic characteristics to account for nuances in adversarial behavior. The paths generated are enhanced by including communication payload and protocol into the Markov decision process (MDP) in order to more realistically emulate attributes of network based exfiltration events. The proposed method will help emulate complex adversarial considerations such as the size of a payload being exported over time or the protocol on which it occurs, as is the case where threat actors steal data over long periods of time using system native ports or protocols to avoid detection. As such, practitioners will be able to improve identification of expected adversary behavior under various payload and protocol assumptions more comprehensively.

reinforcement learning exfiltration penetration testing cyber terrain

CEUR ceur-ws.org

1. Introduction

CEUR exfil trafic through domain name system, (DNS)) to deter detection while obfuscating intent [8, 7, 6, 9, 5].

The previous literature’s drawbacks are discussed in [1] and the topic of using RL for conducting post-exploitation activities such as exfiltration is still under-studied. Previous work [4], for example, employs ontological models of the agent with actions defined using common software modules. While this may be useful in some capacity, it sufers from aligning to network structure, path structure, and cyber terrain, thereby limiting its ability to anchor agents to the real computer network. In addition, the output is not operationally interpretable where security operations center (SOC) and cyber analysts can action results [6, 9, 7, 5].

This paper presents a framework for using RL methods for discovering exfiltration paths in network models while accounting for attacker preferences in payload and protocol preferences. The key contributions of this paper are twofold: 1. An approach for modeling data exfiltration on networks that accounts for choices in 2. The implementation of RL-based algorithms for discovering exfiltration paths in network protocol with varying size of payload.

models.

The presented methodology is aligned with a focus on network structure and configuration, path analysis, and cyber terrain. Its outcomes can be directly understood as paths through networks, as is highlighted in a detailed discussion of the results. To support reproducibility, the RL solution methods, experimental design, and network model are specified in great detail.

The remainder of this work begins with a background on the use of RL for penetration testing followed by an exploration of the methods for modeling defensive terrain and discovering exfiltration paths. Then, the experimental design is described for evaluating the proposed approach, followed by an analysis of the experimental results, and a discussion of the findings. Lastly, this paper concludes with remarks on modeling decisions, a summary of the work, and possible avenues of future research.

2. RL and Penetration Testing 2.1. Reinforcement Learning

RL is a framework where an agent learns to optimize its behaviour by interacting with its environment [10]. A Markov decision process (MDP): ( , , , , ) environment, where is the state space, is the action space, ∶ × → is often used to model the is the transition function, ∶ × × → ℝ

is the reward function and ∈ (0, 1] is the discount factor, which determines the present value of future rewards. The agent’s behavior is characterized by its policy , which is a probabilistic distribution over actions given a state. For deterministic policies, the action taken in state can be denoted as () . Corresponding to each time step, the agent observes a state , on which it takes an action according to (| ), and transitions to ∞ a new state +1 and receives a reward = ( , , +1 ). The cumulative discounted reward is called the return and is defined as = ∑=0 + . The RL agent aims to learn an optimal policy ∗, which maximizes the expected return from each state. RL algorithms can be categorized into three groups: value function-based (or critic-only) methods, policy gradient (or actor-only) methods, and actor-critic methods.

Value function-based methods such as -learning [11] or deep Q-networkx (DQN) [12] learn optimal policies by first estimating the optimal action-value function ∗(, ) : ∗(, ) ≡ max (, ) ≡ max [ | = , = ], which can be obtained by solving the Bellman equation: ∗(, ) = ′[ + max ∗( ′, ′)|, ].

′ ℒ () = [ min ( () , clip( (), 1 − , 1 + ) )], where () = ( | )/ old

( | ) is the probability ratio of the new policy over the old policy.

The advantage function is often estimated using the generalized advantage estimation [ 15], truncated after steps: ̂ = + ( ) +1 + ⋯ + ( )

−+1 −1 , where = + ( +1 ) − ( ).

To support exploration, an entropy bonus () is often added to the objective function ( 5 ), where is a coeficient.

Then, an optimal policy ∗ is derived by selecting the action that yields the largest -value: ∗() = ∗(, ).

On the other hand, policy gradient approaches focus on directly parameterizing the policy (|; ) and optimizing a performance measure () such as the expected return [ gradient ascent. Such methods often sufer from high variance and may result in slow learn ] via ing. Thus, to reduce the variance, actor-critic methods use an estimate of the value function () ≡ [ | = ] as a baseline when estimating the policy gradient ∇ () [13]. The critic is responsible for learning the value function while the actor updates policy parameters by using the estimated policy gradient. In particular, the policy gradient can be estimated as ∇ () ≈ [∇ log ( | ; ) ], where = ( , ) − ( ) represents the advantage of taking action at state .

Policy gradient methods are prone to performance collapse as a result of large policy updates, which can be challenging to recover from because the agent will have been trained on the experience produced by bad policies. To improve training stability, Proximal Policy Optimization (PPO) [14] uses a clipped surrogate objective function: ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 )

2.2. RL applications in penetration testing

Deep RL has been applied to cybersecurity broadly [13], but only recently it has been employed as a tool for penetration testing [16, 17, 18, 19, 20, 21, 22, 23]. There are a number of diferent approaches, but most only consider privilege escalation on a target host as the learning task. Gangupantulu et. al proposed to use concepts of cyber terrain to help enrich task design and reward shaping [23]. This concept spurred the development of several task-specific uses of RL for penetration testing, including crown jewel analysis [24], discovering exfiltration paths [ 1], and exposing surveillance detection routes [25].

As with Gangupantulu et al.[24], the presented RL approach here solves a more complex task and acts as a focused tool for cyber operators to increase the efectiveness of operator workflow in penetration testing. RL for penetration testing has made frequent use of DQN [17, 21, 22, 23, 24]. As an alternative, Nguyen et al. proposed an RL-based approach that makes use of two agents: one for iteratively scanning the network to build a structural model and another for exploiting the constructed model [26]. In this study, Nguyen et al.’s double agent architecture is combined with the PPO algorithm to train the RL agents.

3. Methods

In this section, we present the details of the exfiltration model, the protocol-based path selection criteria, and the complete RL formulation. While the model incorporates several assumptions, it’s fundamentally based on a data-driven approach using scan data. This reliance on scan data not only ensures empirical robustness but also permits iterative refinements, as newer or more comprehensive data become available, to progressively approach a more accurate representation of reality.

3.1. Exfiltration Simulation Overview

The approach proposed here expands on previous models for data exfiltration in that it can model paths for diferent payload sizes. The exfiltration campaign is modeled based on three tasks consisting of (i) Connection, (ii) Path Selection and (iii) Exfiltration. The attacker initially attempts to gain control of some of the known target hosts which are externally connected via an internet connection to serve as the point of exfiltration. Once control of the target host is gained, an exfiltration path is selected based on the preferences for an exfiltration protocol. The attacker then tries to exfiltrate data packets from the compromised host. The three tasks are designed to function so that if an attacker discovers a new exfiltration host, the path selection module determines whether a better path exists and adjusts the exfiltration path accordingly.

The agent explores the network and gathers information on neighboring hosts by taking the subnet scan action. In order for the scan to be successful, the agent must first gain access to the underlying host, which can be achieved by executing an exploit action. Multiple exploits may exist for a given machine, with each targeting a specific Common Vulnerabilities and Exposures (CVE) vulnerability.

Once a foothold is gained on a new host, the agent updates candidate exfiltration paths that consist of each of the compromised hosts. It will then decide which path is preferred to carry out the exfiltration based on the predetermined Exfiltration protocol strategy. If another target is captured later, the path selection task evaluates the new paths available and, if a new preferred path is discovered, the path is updated and the payload is reset to its original value.

After identifying the preferred exfiltration path, the agent can start sending parts of the payload to the exit node. The task is completed if the entire payload is uploaded from the initial node. In order to evade firewall detection, the agent should avoid frequent and large uploads. To hide its activity, the agent may take a sleep action that simply does nothing for a period of time.

3.2. Network Firewalls

As in [27], any exfiltration trafic will be monitored by network firewalls, which are placed between each of the subnets and the public Internet. Upon detection of unusual trafic patterns, the administrator will be alerted and an emergency firewall update will be conducted. Examples of suspicious activities include the following: • the total egress volume exceeds max_upload_volume; • the total active time surpasses max_upload_time.

Firewalls are also updated periodically. In particular, a wall-clock is introduced to simulate the real time of an attack campaign. Diferent actions will increase the clock time by diferent amounts depending on the their complexity. Both the regular update and the emergency update will patch the vulnerabilities and block the outbound trafic from the compromised hosts.

3.3. Protocol-Based Path Selection

Exfiltration activities within attacker campaigns are typically carried out by exploiting a common protocol as these are deemed generally safer and less likely to be detected by security monitoring. Standard protocols, such as Hypertext Transfer Protocol Secure (HTTPS), are often used to carry out data exfiltration. By using common protocols used by enterprise applications, it’s more likely these protocols are available. It’s also more likely these protocols are not monitored as closely by security detection methods. As an example, by using the same protocol used by databases to backup their data to cloud services, attackers emulate the database backup expected by security rules and do not raise alerts in monitoring systems.

Path selection is determined by maximizing the utilization of this protocol across as many hosts in an exfiltration network path as possible. This is not always the shortest path. A path maximizing the use of the chosen protocol is often more advantageous, even when this path touches more nodes in the victims network. The path selection algorithm accounts for contingencies when end-to-end use of the designated exfiltration protocol is unavailable. The algorithm prioritizes finding a complete path using the given protocol over a shortest path possible. The next level of criterion considered are the length of the path and rewards accumulated. If multiple paths are identified with the same exfiltration protocol coverage, the shortest path will be prioritized. When no complete path can be created using the exfiltration protocol, the algorithm searches for the shortest path exposed to the maximum use of the protocol. The reward function calculates the highest rewarded path using existing reward mechanisms, shortest number of hosts, and maximum use of the exfiltration protocol.

3.4. Reinforcement Learning Formulation 3.4.1. State Space

The state has the following features for every host: • Address, • Operating system, • Services and processes, • Discovery value and status, • Infection value and status, • Access level information.

Host’s address is denoted by its subnet ID and local ID. The operating system, service and process features have a value of one of if they are present at the host and zero otherwise. Similarly, the discovery and infection status are one if the host is discovered or compromised and zero otherwise. The discovery and infection values represent the reward for successfully discovering and compromising a host, respectively. Additional features are defined for target hosts: • Connection status, • Time since infection, • Remaining payload size, The connection status can be connected, not connected, or isolated (i.e., blocked by firewalls). The time since infection is measured by the wall-clock rather than time steps. Finally, the remaining payload size indicates how much left to upload. The exfiltration task is complete when the remaining payload size becomes zero.

3.4.2. Action Space

There are four types of actions for the RL agent: subnet scan, exploit, upload, and sleep. Each action requires specification of a target host, except for the sleep action, which simply does nothing for a given period of time. Multiple exploits targeting at diferent vulnerabilities may be available for a given host. Two uploading actions with diferent speed are available at each target - one with a rate of 100MB/s and another with rate 1MB/s.

Clock-time increases diferently based on the action’s result and complexity. Table 2 lists the assigned clock time for each action. For not applicable actions, such as performing as subnet scan without access to the underlying host, the clock time will only move forward by one second.

3.4.3. Reward Function

The reward function consists of a positive value for achieving sub-goals such as discovering or exploiting a host and a negative value that accounts for the action’s cost. An action with higher cost is more likely to trigger the defense terrain. Specifically, we follow the approach in [1] and assign action’ cost based on the services running on the target system. The idea is that even though the adversaries may not know the exact defense mechanism or strength, they can still infer the presence of defense based on the host’s service information. In particular, we categorized the services into three groups, high-risk, medium-risk, and low-risk. The actual cost of an action then depends on its type (scan, exploit, or upload) and the target’s service profile.

Rewards are given based on how much of the exfiltration path chosen is covered by the exfiltration protocol, For example if out of the 6 hosts in the exfiltration path 3 hosts have exfiltration protocol running then 50 percent of the reward configured will be given to the agent. The agent receives positive reward on uploading a partial payload from the infected host, upon finishing sending the entire payload, the agent is given a large bonus reward. However, if exfiltration is detected by network firewalls, then the agent will receive a penalty equal to the total accumulated rewards gained on the originating host and the host will be isolated. That is, the agent will lose all rewards from discovery, infection and partial uploads. Table 3 lists rewards used in this study.

4. Experiments

In this section we present the experiment details and the results, and discuss key characteristics of the attack paths learned by the RL agent.

4.1. Network Description

We have designed two experiment networks. The first experiment network has 10 subnets and a total of 56 hosts. Each subnet contains between 3 and 12 hosts. The attacker agent is assumed to have gained an initial foothold on host ( 8, 2 ) in subnet 8, which is not directly connected to the Internet. One particular machine ( 2, 0 ) from subnet 2 is designated as the exfiltration host. Subnet 2 is directly accessible from the internet whereas, other subnets are private and are not directly accessible from the Internet. The exfiltration host has Dynamic Host Configuration Protocol Server (DHCPS) running as a service, which is chosen as the Exfiltration Protocol. The second experiment network has 101 subnets and a total of 1444 hosts. This network is remarkably bigger than the one used previously. Each subnet contains between 3 and 50 hosts. The attacker agent is assumed to have gained an initial foothold on host ( 44, 5 ) in subnet 44, which is not directly connected to the Internet. A host connected to the internet ( 5, 10 ) from subnet 5 is designated as exfiltration host. The exfiltration host has running HTTPS service, which is chosen as the Exfiltration Protocol.

4.2. Training Details

The RL agent is trained in an episodic fashion for both the networks using the well-known PPO algorithm. An episode ends when the initial host either completes sending payload to the exfiltration host or is isolated by firewalls. The target payload is set to be 10,000MB. Both the actor and the critic are approximated by a two-layer feed-forward neural network, where the ifrst layer has 64 neurons, and the second layer has 32 neurons. Other key hyperparameters are listed in Table 4. For the first network the RL agent is trained for 800 episodes and for the second network RL agent is trained for 1000 episodes.

5. Results

For the first network, episode rewards over training runs are presented in Fig. 1a and episode length in Fig. 1b, and for the second network, episode rewards over training runs are presented in Fig. 2a and episode length in Fig. 2b. Training is observed to be stable for both networks, and the RL policy converges in 800 episodes for the first network and in 1000 episodes for the second network. Fig. 1a shows that the sum of rewards in an episode for the first network steadily increases to almost 12,000, and Fig. 2a shows that the sum of rewards in an episode for second network steadily increases to a little more than 10,000. During the same intervals, the episode length gradually decreases for both simulations. This suggests that as training goes on, the RL agent completes the attack task more eficiently and takes fewer random actions. (a) (b)

Due to the stochastic nature of the learned policy, the RL agent may take some unnecessary or redundant actions such as exploiting unimportant hosts or subnet scans. After pruning the output trajectory, key steps in the attack for the simulation of the first network can be identified as shown in Table 7.

For the first network, the agent gains a foothold on host ( 8, 2 ) in subnet 8, from which it triggers a subnet scan which leads to the discovery of other hosts in the same subnet and in the connected subnets, subnet 4 and subnet 6. The agent then exploits the host ( 4, 2 ) in subnet 4 and it is chosen as a host for further exploitation to make an exfiltration path. A subnet scan is triggered from the host ( 4, 2 ), which discovers the hosts present in connected subnets i.e., subnet 2 and ultimately discovers the target or exfiltration host ( 2, 0 ), which is then compromised to forge an exfiltration path i.e., ( 8, 2 ) → ( 4, 2 ) → ( 2, 0 ). In search of availability of better paths, agent exploits the host ( 6, 0 ) in subnet 6, and triggers a subnet scan from that host, discovering hosts on connected subnet i.e., subnet 5. This scan discovers host ( 5, 1 ) in subnet 5 and is later exploited to forge another exfiltration path i.e., ( 8, 2 ) → ( 6, 0 ) → ( 5, 1 ) → (2, 0.)

The path explored earlier i.e., ( 8, 2 ) → ( 4, 2 ) → ( 2, 0 )is not a complete exfiltration protocolbased path, since there is no DHCP (exfiltration protocol) service running on host ( 4, 2 ) as shown in Fig. 3. However, the second path discovered i.e., ( 8, 2 ) → ( 6, 0 ) → ( 5, 1 ) → ( 2, 0 )is a complete exfiltration protocol-based path since the same service (i.e., DHCP) is running on both hosts ( 6, 0 ) and ( 5, 1 ) as shown in Fig. 3. Noticeably, the agent chooses the second path over ifrst path to upload payload as it is 100 percent protocol-based path and is the optimal path, even though the first path discovered is shorter in length.

For the second network, the agent has a foothold over the host ( 44, 5 ) in subnet 44. Upon performing various subnet scans and exploits, the agent gets a hold over the host ( 24, 18 ), and ultimately discovers and exploits the target or exfiltration host ( 5, 10 ). This led to development of exfiltration path i.e., ( 44, 5 ) → ( 24, 18 ) → ( 5, 10 ). The host ( 24, 18 ) has HTTPS service running on it, hence the path forged is a complete protocol-based path. The capability of the agent to forge a 100 percent protocol-based path over such a big network indicates that the model is scalable as well.

For both networks the agent takes appropriate sleep actions in between the upload actions so that there is no unusual trafic pattern and cyber defenses are not triggered.

The agent found paths in both networks that utilize a single network protocol. In real-life scenarios, attackers try to use a single protocol to avoid increasing attack complexity and reduce the risks of inconsistencies or errors, which can lead to a greater possibility of detection. Choosing to exfil data using existing network protocols that the network defenses (firewalls, IDS) know about also reduces the risk of discovery by trafic anomaly detection algorithms. Using standard protocols for exfiltration while considering trafic timing and volume replicates previously documented Tactics, Techniques, and Procedures (TTP)s[28].

While novel exfiltration methods that use non-standard protocols exist,Domain Name Service (DNS), Network Time Protocol (NTP), or Internet Control Message Protocol (ICMP), they typically require complex setup for execution [8]. They also are usually more closely monitored by defensive measures for volume and anomalous behaviors than standard protocols due to their usage in previous exfiltration operations [ 8]. Data exfiltration requires more network volume and can be more stealthily sent over less strictly monitored or eccentric channels [29].

6. Conclusion

The current gap within the cybersecurity industry involves contextualizing and quantitatively prioritizing the eficacy of deployed security controls to enable sense-making for security practitioners and network defenders. In this paper, we address this gap through applying RL for exfiltration path analysis enhanced by integrating protocol and payload considerations. Our work demonstrates that an RL agent can efectively find an exfiltration path with maximum exfiltration protocol coverage and can perform exfiltration using this preferred path without being detected by security infrastructure (i.e., firewalls). Our results identify optimal paths that provide insights for operators, analysts, and defenders to evaluate the value of currently deployed security controls which influence (i.e., isolate or eliminate) the connections within the path. As a result, the operations community can utilize this data to formulate task lists for securing enterprise networks.

This RL approach identified the most likely hosts and services used when exfiltrating data while capturing variable metrics used in network risk assessments. The strength of this approach was validated through identification of intentional network misconfigurations that mimic realworld vulnerabilities. In future work we consider expanding the risk formalism to increase its sophistication and maturity, which will drive increased applicability and relevance. [22] Z. Hu, R. Beuran, Y. Tan, Automated penetration testing using deep reinforcement learning, in: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), IEEE, 2020, pp. 2–10. [23] R. Gangupantulu, T. Cody, P. Park, A. Rahman, L. Eisenbeiser, D. Radke, R. Clark, Using cyber terrain in reinforcement learning for penetration testing, arXiv preprint arXiv:2108.07124 (2021). [24] R. Gangupantulu, T. Cody, A. Rahman, C. Redino, R. Clark, P. Park, Crown jewels analysis using reinforcement learning with attack graphs, arXiv preprint arXiv:2108.09358 (2021). [25] L. Huang, T. Cody, C. Redino, A. Rahman, A. Kakkar, D. Kushwaha, C. Wang, R. Clark, D. Radke, P. Beling, et al., Exposing surveillance detection routes via reinforcement learning, attack graphs, and cyber terrain, arXiv preprint arXiv:2211.03027 (2022). [26] H. V. Nguyen, S. Teerakanok, A. Inomata, T. Uehara, The proposal of double agent architecture using actor-critic algorithm for penetration testing., in: ICISSP, 2021, pp. 440–449. [27] C. Wang, A. Kakkar, C. Redino, A. Rahman, S. Ajinsyam, R. Clark, D. Radke, T. Cody, L. Huang, E. Bowen, Discovering command and control channels using reinforcement learning, in: SoutheastCon 2023, IEEE, 2023, pp. 685–692. [28] Mitre att&ck framework®, 2021. URL: https://attack.mitre.org. [29] B. Sabir, F. Ullah, M. A. Babar, R. Gaire, Machine learning for detecting data exfiltration: A review 54 (2021).

[1]

Cody ,

Rahman ,

Redino ,

Huang ,

Clark ,

Kakkar ,

Kushwaha ,

Park ,

Beling , E. Bowen, Discovering exfiltration paths using reinforcement learning with attack graphs , in: 2022 IEEE Conference on Dependable and Secure Computing (DSC) , IEEE, 2022 , pp. 1 - 8 .

[2] N. I. of Standards , Technology, Security and Privacy Controls for Federal Information Systems and Organizations , Technical Report NIST Special Publication 800-53 Revision 5, U.S. Department of Commerce, Washington, D.C., 2020 .

[3]

Conti ,

Raymond , On cyber: towards an operational art for cyber conflict , Kopidion Press, 2018 .

[4]

Maeda ,

Mimura , Automating post-exploitation with deep reinforcement learning , Computers & Security 100 ( 2021 ) 102108 .

[5]

Ahmed ,

H. H.

Gharakheili ,

Raza ,

Russell ,

Sivaraman , Real-time detection of dns exfiltration and tunneling from enterprise networks, in: 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM) , IEEE, 2019 , pp. 649 - 653 .

[6]

Nadler ,

Aminov ,

Shabtai , Detection of malicious and low throughput data exfiltration over the dns protocol , Computers & Security 80 ( 2019 ) 36 - 53 .

[7]

Zhan ,

Li ,

Yu ,

Li ,

Wang , Detecting dns over https based data exfiltration , Computer Networks 209 ( 2022 ) 108919 .

[8]

Zhang , L. Yang , S.

Yu , J.

Ma , A dns tunneling detection method based on deep learning models to prevent data exfiltration, in: Network and System Security: 13th International Conference , NSS 2019, Sapporo, Japan, December 15-18 , 2019 , Proceedings 13, Springer, 2019 , pp. 520 - 535 .

[9] A. Das , M.-Y. Shen , M.

Shashanka , J.

Wang , Detection of exfiltration and tunneling over dns , in: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) , IEEE, 2017 , pp. 737 - 742 .

[10]

R. S.

Sutton ,

A. G.

Barto , Reinforcement learning: An introduction , MIT press, 2018 .

[11] C. J. C. H. Watkins , Learning from delayed rewards ( 1989 ).

[12]

Mnih ,

Kavukcuoglu ,

Silver ,

A. A.

Rusu ,

Veness ,

M. G.

Bellemare ,

Graves ,

Riedmiller ,

A. K.

Fidjeland ,

Ostrovski , et al., Human-level control through deep reinforcement learning , Nature 518 ( 2015 ) 529 - 533 .

[13]

T. T.

Nguyen ,

V. J.

Reddi , Deep reinforcement learning for cyber security , arXiv preprint arXiv: 1906 . 05799 ( 2019 ).

[14]

Schulman ,

Wolski ,

Dhariwal ,

Radford ,

Klimov , Proximal policy optimization algorithms , arXiv preprint arXiv:1707.06347 ( 2017 ).

[15]

Schulman ,

Moritz ,

Levine ,

Jordan ,

Abbeel , High-dimensional continuous control using generalized advantage estimation , arXiv preprint arXiv:1506.02438 ( 2015 ).

[16] M. C. Ghanem , T. M. Chen , Reinforcement learning for intelligent penetration testing , in: 2018 Second World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4) , IEEE, 2018 , pp. 185 - 192 .

[17]

Schwartz ,

Kurniawati , Autonomous penetration testing using reinforcement learning , arXiv preprint arXiv: 1905 . 05965 ( 2019 ).

[18] M. C. Ghanem , T. M. Chen , Reinforcement learning for eficient network penetration testing , Information 11 ( 2020 ) 6 .

[19]

Chaudhary , A. O'Brien , S. Xu , Automated post-breach penetration testing through reinforcement learning , in: 2020 IEEE Conference on Communications and Network Security (CNS) , IEEE, 2020 , pp. 1 - 2 .

[20]

Yousefi ,

Mtetwa ,

Zhang ,

Tianfield , A reinforcement learning approach for attack graph analysis , in: 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science And Engineering (TrustCom/BigDataSE), IEEE, 2018 , pp. 212 - 217 .

[21]

Chowdhary ,

Huang ,

J. S.

Mahendran ,

Romo ,

Deng ,

Sabur , Autonomous security analysis and penetration testing , in: 2020 16th International Conference on Mobility, Sensing and Networking (MSN) , IEEE, 2020 , pp. 508 - 515 .