=Paper= {{Paper |id=Vol-3916/paper6 |storemode=property |title=A Path Discovery Method of Penetration Testing Based on SAC Reinforcement Learning |pdfUrl=https://ceur-ws.org/Vol-3916/paper_06.pdf |volume=Vol-3916 |authors=Yimeng Liu,Xiaojian Liu,Xuejun Yu |dblpUrl=https://dblp.org/rec/conf/iwesq/LiuLY24 }} ==A Path Discovery Method of Penetration Testing Based on SAC Reinforcement Learning== https://ceur-ws.org/Vol-3916/paper_06.pdf
                         A Path Discovery Method of Penetration Testing Based on SAC
                         Reinforcement Learning⋆
                         Yimeng Liu1,*,† , Xiaojian Liu2,† and Xuejun Yu3,†
                         1
                           Beijing University of Technology, Beijing, China
                         2
                           Beijing University of Technology, Beijing, China
                         3
                           Beijing University of Technology, Beijing, China


                                         Abstract
                                         As the complexity and scale of systems continue to increase, enterprises are placing ever higher demands on system security, making
                                         comprehensive security analysis particularly important. Among the various methods, penetration testing is regarded as one of the
                                         most direct means for assessing security. In order to explore the resistance (to attacks) standard within the SQuaRE series standards
                                         concerning product quality, this paper proposes a penetration path discovery method based on an improved deep reinforcement learning
                                         algorithm, Adaptive Soft Actor-Critic (ASAC). This method involves modeling the penetration process and quantifying the penetration
                                         benefits, and it leverages the Soft Actor-Critic (SAC) algorithm, which is suitable for complex state spaces, to construct an intelligent
                                         penetration agent. The goal is to solve for the optimal penetration path, thereby enabling the evaluation of a system’s resistance
                                         to attacks during system validation.Experimental results demonstrate that the attack effectiveness of the proposed ASAC algorithm
                                         surpasses that of commonly used reinforcement learning algorithms such as Q-learning and DQN. It can quickly identify the most
                                         critical penetration paths in a network and maintain high performance across different network environments. This approach provides
                                         effective theoretical and technical support for the comprehensive assessment of the robustness of the cybersecurity of the system.

                                         Keywords
                                         Deep Reinforcement Learning, Penetration test, Attack path discovery



                         1. Introduction                                                                                            penetration testing information. Furthermore, reinforce-
                                                                                                                                    ment learning is employed as the agent’s learning strategy,
                         As business scales grow, many enterprises choose to store                                                  enabling it to interact with the environment to maximize
                         data across multiple physical devices (such as servers) to                                                 its reward. Penetration testing can also be viewed as a
                         enhance access speeds, with unified management and access                                                  dynamic decision-making process based on the current en-
                         via networks. Attackers may infiltrate the data storage lo-                                                vironmental state. Given the similarity between penetration
                         cal area network through boundary nodes, thereby stealing                                                  testing and reinforcement learning mechanisms, reinforce-
                         sensitive information and causing significant losses to the                                                ment learning is well-suited to describe penetration testing
                         enterprise. It is crucial for enterprises to periodically con-                                             in unknown environments.
                         duct penetration testing based on the state of their network                                                  This paper proposes an improved reinforcement learning
                         environment, to assess the system’s risk and robustness.                                                   algorithm, ASAC, for penetration path planning in unknown
                         Penetration testing is a black-box security testing method,                                                environments. By improving search strategies and action
                         in which the tester adopts the mindset and technical means                                                 selection policies, the algorithm enhances the planning effi-
                         of an attacker to detect vulnerabilities in business system                                                ciency.
                         targets. It helps enterprises uncover hidden security flaws                                                   Additionally, the proposed method aligns with the
                         and vulnerabilities in normal business processes, breach the                                               ISO/IEC 25000 series standards, specifically the SQuaRE
                         system’s security defenses, and deeply assess the potential                                                (Software Quality Requirements and Evaluation) framework,
                         impacts of vulnerabilities.                                                                                particularly the security resistance sub-characteristics. Ac-
                            In real-world enterprise internal networks and sensitive                                                cording to the SQuaRE standard, security, as an important
                         core networks, initial stages of penetration testing may not                                               aspect of software quality, requires that the system be able
                         provide valuable information due to the lack of direct in-                                                 to resist attacks and ensure the confidentiality, integrity
                         sight into the system. The Markov process (Markov Process)                                                 and availability of data. Our proposed penetration testing
                         is a type of stochastic process characterized by the property                                              method, by optimizing the attack path discovery, can evalu-
                         that future states depend only on the current state, not on                                                ate the system’s resistance and defense capability according
                         past states. This paper formalizes the penetration testing                                                 to the calculation time of the model and the success rate
                         process as a Markov Decision Process (MDP), utilizing net-                                                 of multiple simulated attacks. In the ISO/IEC 25010 stan-
                         work information to build a reward function that guides the                                                dard corresponding security features such as confidentiality,
                         agent to discover hidden attack paths from the attacker’s                                                  integrity, availability and resistance have been effectively
                         perspective. This method does not require prior validation                                                 improved.
                         of network structures or software configurations and can                                                      The main contributions of this paper are as follows:
                         autonomously discover attack paths, extracting essential
                                                                                                                                         • We combines complex multi-step attack characteris-
                         6th International Workshop on Experience with SQuaRE family and its                                               tics to conduct penetration test modeling, quantifies
                         Future Direction,3rd December 2024,Chongqing, China(Hybrid)
                         ⋆                                                                                                                 action gains and network environment vulnerability
                           You can use this document as the template for preparing your publica-
                           tion. We recommend using the latest version of the ceurart style.
                                                                                                                                           losses, and more accurately simulates the process of
                         *
                           Corresponding author.                                                                                           attackers gradually breaking through various fire-
                         †
                           These authors contributed equally.                                                                              walls and gateways, providing accurate reward sig-
                         $ lymon1004@163.com (Y. Liu); liuxj@bjut.edu.cn (X. Liu);                                                         nals for the generation of multi-step and multi-host
                         759301040@qq.com (X. Yu)                                                                                          attack paths.
                          0009-0009-6580-9840 (Y. Liu); 0000-0002-0666-4102 (X. Liu)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                     Attribution 4.0 International (CC BY 4.0).                                                          • SAC reinforcement learning algorithm suitable for

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
        complex discrete action or even continuous action         and there is still much room for improvement in the con-
        space is introduced into the penetration testing pro-     vergence speed and scalability of the algorithm.Based on
        cess, and the improved intelligent penetration test-      the above research, we further study penetration testing in
        ing agent (ASAC) can cope with the complex and            complex scale scenarios. In this paper, the infiltration test
        changeable network environment and the exponen-           attack path is modeled from the perspective of multi-step
        tial growth of the action space.                          attack, and it is combined with MDP. On this basis, an in-
                                                                  telligent decision method for complex computing network
                                                                  attacks is proposed.
2. Related work
Traditional penetration testing (PT) relies on manual meth-       3. Methodology
ods, which become impractical as the systems grow in size
and complexity. By simulating 20 attack strategies of real        3.1. Markov Decision Model
attackers, automated penetration testing technology uses
various algorithm models to automate the penetration of 21        Figure 1 shows a typical penetration testing scenario, which
target networks, significantly reducing test cost and improv-     we use as a case study to illustrate the issues discussed in this
ing penetration efficiency [1] [2]. Automated penetration         article. The target network consists of the extranet, DMZ,
testing is an important application of artificial intelligence    and intranet subnets. The subnets contain user nodes and
technology in the field of network security [3]. Zang et al.      data nodes (sensitive nodes). The function of the intelligent
[4] summarized the current research progress of attack path       penetration agent is to determine the attack target of each
discovery in automated penetration testing, and proposed          step according to the state observation to reach the final
future research directions. Modeling the attack environ-          sensitive node and plan the optimal scheme.
ment through Markov process can well help us describe
and calculate the penetration test path. Literature [5] mod-
eled the environment as Markov decision process diagram
based on the attack graph, and used the value iterative algo-
rithm to find the optimal penetration path. This approach
can help penetration testers find the most effective attack
path, thereby improving the efficiency and success rate of
penetration testing. Literature [6] proposes an automatic
attack planning algorithm (NIG-AP) based on network in-
formation gain, which formalizes penetration testing into a
Markov decision process, finds potential attack paths from
the attacker’s perspective, and uses network information to
optimize attack strategies. Zhou et al. [7] proposed an at-
tack path planning algorithm based on network information
gain, which formalized penetration testing into a Markov de-
cision process, used network information to obtain rewards,
and guided agents to choose the best response actions. The
method based on reinforcement learning, which can simu-
late the uncertainty of offense and defense in the real world
by designing the probability of success of action execution,
is an important research direction in this field. Schwartz
et al. [8] designed a lightweight network attack simulator        Figure 1: Example of Penetration Testing
NASim, which provides a benchmark platform for network
attack defense simulation test. They verify the effectiveness
                                                                     The Markov Decision Process (MDP) is commonly mod-
of basic reinforcement learning algorithms, such as Deep
                                                                  eled as a quadruple ⟨𝑆, 𝐴, 𝑅, 𝑃 ⟩, where:
Q-learning Network(DQN), in the application of penetra-
tion path discovery. In order to improve the convergence               • 𝑆 represents the set of penetration states observed
speed of DQN algorithm in path discovery problems, Zhou                  by the agent, such as network topology, host infor-
Shicheng et al. [9] proposed an improved reinforcement                   mation, and vulnerability details. These states cor-
learning algorithm, noise-double-dueling DQN. Hoang Viet                 respond to various stages in the penetration testing
Ngueyen et al. [10] introduced an A2C algorithm with dual                process, which typically proceeds in the following
agent architecture, which is responsible for path discovery              sequence:
and host utilization respectively. Zeng Qingwei et al. [11]                  – a) Information Gathering: The initial
suggested using hierarchical reinforcement learning algo-                      phase, where information regarding the net-
rithm to solve the problem of path discovery and host utiliza-                 work, systems, and potential vulnerabilities
tion being handled separately. In summary, current research                    is collected.
on the discovery of intelligent penetration test paths is still
                                                                             – b) Vulnerability Assessment: The process
in the preliminary stage. The advantage of MDP model is
                                                                               of identifying and evaluating weaknesses or
that it can model the uncertainty in the process of penetra-
                                                                               vulnerabilities within the network and sys-
tion testing well, but it brings the increase of computational
                                                                               tems.
complexity, which is difficult to apply to large-scale network
                                                                             – c) Exploitation: The stage where discovered
scenarios. The method based on reinforcement learning is
                                                                               vulnerabilities are exploited to gain unautho-
only experimentally verified in simple network scenarios,
                                                                               rized access to systems.
           – : d) Privilege EscalationThe process of ele-                Damage to the network as (4):
              vating access privileges once entry has been                                         ∑︁
              gained, allowing deeper penetration into the                      𝑊 = max(𝑊 ) =          (𝑥𝑖 + 𝑣𝑖 ) · 𝑝𝑖                        (4)
              network.                                                                                     𝑖<𝑗
           – e) Maintaining Access: Ensuring persistent
              access to the compromised system.                          So the attack reward function (5) in the current state
           – f) Reporting: The final phase, in which                                      ∑︁
              findings are documented, including discov-                 𝑅=𝑊 −𝐶 =            {(𝑥𝑖 + 𝑣𝑖 ) · 𝑝𝑖 − (𝑎𝑖 + 𝑛𝑖 ) · 𝑠𝑖 }             (5)
                                                                                             𝑖<𝑗
              ered vulnerabilities and the level of access
              achieved.                                                  𝑃 is the breach probability of a node, and the attack threat
     • 𝐴 denotes the set of possible attack actions, cor-                degree is related to the corresponding service vulnerability
       responding to the aforementioned stages. These                    scores of different nodes in different networks.
       actions include network scanning, host scanning,
       vulnerability exploitation, privilege escalation, and             3.3. Improved Soft Actor-Critic Algorithm
       others available to the agent.
     • 𝑅 is the reward function 𝑅(𝑠), which assigns a re-                Reinforcement learning algorithms can be categorized into
       ward based on the different penetration states. For               three main types: value-based, policy-based, and actor-
       example, gaining the highest level of permission on               critic-based. SAC is an off-policy reinforcement learning
       a sensitive host might yield a reward of 100, while               algorithm that combines maximum entropy learning with
       breaching an Intranet subnet host could result in a               the actor-critic framework. It learns a policy network to
       reward of 1, and no reward is given for breaching a               select optimal penetration actions while estimating state
       DMZ host.                                                         and action values, thereby maximizing the policy’s entropy
     • 𝑃 represents the state transition function                        to encourage exploration and diversification in attack path
       𝑃 (𝑠, 𝑎, 𝑠′ ) = 𝑃 𝑟(𝑠′ |𝑠, 𝑎), which defines the                  selection. Currently, SAC is an efficient model-free rein-
       probability of transitioning from one state to                    forcement learning algorithm capable of learning stochastic
       another after performing a given action. This is                  policies, achieving state-of-the-art results in many standard
       typically associated with the success rate of the                 environments. Figure 2 shows the architecture of the im-
       attack actions.                                                   proved Attack Soft Actor-Critic(labelled as ASAC) algorithm.
                                                                         The ASAC algorithm consists of a policy network and four
   The objective of the MDP is to select the optimal policy              value function networks. The policy network acts as the ac-
𝑎𝑡 = 𝜋(𝑠𝑡 ) that maximizes the long-term cumulative re-                  tor, outputting attack actions toward the environment, while
ward 𝐺(𝑠0 ) for the current state, as expressed in equation              the four value networks evaluate the policy network. Since
(1).                                                                     it is a model-free algorithm, SAC uses an experience replay
                                                                         buffer to store all actions and environmental feedback data,
                                                                         and updates the networks through random sampling. Given
              {︃𝑇 −1                                          }︃
                ∑︁
𝐺 (𝑠0 ) = 𝐸             𝑡
                       𝛾 𝑅 (𝑠𝑡+1 ) 𝑃 [𝑠𝑡 , 𝜋 (𝑠𝑡 ) , 𝑠𝑡+1 ]        (1)   that actions in reinforcement learning are often highly cor-
                 𝑡=0                                                     related, this approach allows the neural network to achieve
  where 𝛾 ∈ (0, 1) denotes the discount factor, used to                  more effective training.Different from other RL algorithms,
balance the importance of current rewards against future                 in order to encourage exploration, the concept of entropy
rewards.                                                                 is added in SAC algorithm, and entropy regularization in-
                                                                         creases the exploration degree of reinforcement learning
                                                                         algorithm. The greater 𝛼 is, the stronger the exploration is,
3.2. Attack cost quantification model                                    which helps to accelerate the subsequent strategy learning
Designing a quantitative attack cost model can comprehen-                and reduce the possibility of the strategy falling into poor
sively consider the investment and return of attack behavior             local optimal.
from multiple dimensions, so as to provide a more objective
basis for network security evaluation. For the MDP process               3.3.1. Policy network design
discussed above, the penetration test adopts multiple attack
                                                                         It can be seen that the input of the policy network Actor is
modes to penetrate the network. Assuming that the threat
                                                                         the network status 𝑠𝑡 , the output is 𝑃 𝑖(𝑎𝑖|𝑆𝑡) action policy,
of a certain type of attack is 𝑋, the launching cost of such             and the update loss expression of its neural network is :
an attack is 𝐴, the attack cost of node i is 𝑁 , and the attack
value is 𝑉 , then the attack cost of a certain node 𝑗 is at-                MSELoss = −
                                                                                               1            ∑︁
                                                                                                                           𝐸𝑎′ ∼𝜋(·|𝑠𝑡 ;𝜃)
tacked. Suppose that on the optimal attack path, the state of                                 |ℬ|                               𝑡
                                                                                                                                              (6)
                                                                                                   (𝑠𝑡 ,𝑎𝑡 ,𝑟𝑡+1 ,𝑠𝑡+1 )∈ℬ
node 𝑖 relative to the attacker is S,After the attack fails, the
                                                                                          𝑞0 𝑠𝑡 , 𝑎′𝑡 − 𝛼 ln 𝜋 𝑎′𝑡 |𝑠𝑡 ; 𝜃
                                                                                        [︀ (︀        )︀            (︀      )︀]︀
network will detect the attack, thus increasing the attack
cost.LC(limit cost) a given cost, it is assumed that the re-
sources held by the attacker can consume LC at most, which                              [︁ (︁              )︁                       )︀]︁
                                                                         E𝑎′ ∼𝜋(·|𝑠𝑡 ;𝜃) 𝑞0 𝑠𝑡 , 𝑎′𝑡 ; 𝑤(0) − 𝛼 ln 𝜋 𝑎′𝑡 |𝑠𝑡 ; 𝜃
                                                                                                                        (︀
can help the intelligent agent accelerate the convergence                  𝑡

speed during the network search attack.                                                       ∑︁                     [︁ (︁               )︁                   )︀]︁
                                                                                                       𝜋 (𝑎𝑡 |𝑠𝑡 ; 𝜃) 𝑞0 𝑠𝑡 , 𝑎′𝑡 ; 𝑤(0) − 𝛼 ln 𝜋 𝑎′𝑡 |𝑠𝑡 ; 𝜃
                                                                                                                                                 (︀
                                                                                        =
                                                                                            𝑎′𝑡 ∈𝒜(𝑠𝑡 )
                            ∑︁
          𝐶 = min(𝐶) =          (𝑎𝑖 + 𝑛𝑖 ) · 𝑠𝑖               (2)
                               𝑖<𝑗
                                                                                                                                              (7)

it satisfies inequality as (3):                                           You can observe the Equation (7) that (︁the three ele-  )︁
                        𝐶 <= 𝐿𝐶                                    (3)   ments in the equation — 𝜋(𝑎𝑡 |𝑠𝑡 ; 𝜃), 𝑞0 𝑠𝑡 , 𝑎′𝑡 ; 𝑤(0) ,
                                                                                              which can be written as:

                                                                                                                                [︂       (︁              )︁                      ]︂
                                                                                           (𝑣)
                                                                                                        ∑︁
                                                                                                               𝜋 𝑎′𝑡 |𝑠𝑡 ; 𝜃       min 𝑞𝑖 𝑠𝑡 , 𝑎′𝑡 ; 𝑤(𝑖) − 𝛼 ln 𝜋 𝑎′𝑡 |𝑠𝑡 ; 𝜃
                                                                                                                (︀           )︀                                   (︀           )︀
                                                                                         𝑈𝑡      =
                                                                                                                               𝑖=0,1
                                                                                                     𝑎′𝑡 ∈𝒜(𝑠𝑡 )

                                                                                                                                           (︁               )︁
                                                                                              We observe that 𝜋 (𝑎′𝑡 |𝑠𝑡 ; 𝜃) , min𝑖=0,1 𝑞𝑖 𝑠𝑡 , 𝑎′𝑡 ; 𝑤(𝑖)
                                                                                         ,ln 𝜋 (𝑎′𝑡 |𝑠𝑡 ; 𝜃) in these terms perfectly match the loss cal-
                                                                                         culation described in Figure 2.
                                                                                            The output of the V critic network is used as a predicted
                                                                                         value, and finally, MSE loss is applied as the loss function,
                                                                                         training the V neural network.

                                                                                         3.3.4. Maximum entropy reinforcement learning
                                                                                         As we mentioned earlier,entropy represents a measure of
                                                                                         the degree of randomness of a random variable. Specif-
                                                                                         ically, if 𝑋 is a random variable and its probability den-
                                                                                         sity function is 𝑝, then its entropy 𝐻(𝑋) is defined as
                                                                                         𝐻(𝑋) = E𝑥∼𝑝 [− log 𝑝(𝑥)].

Figure 2: Architecture of ASAC                                                           3.3.5. Regional Experience Replay Buffer
                                                                                         In a large-scale network environment, direct use of a sin-
                                                                                         gle experience playback pool may cause the experiences
and ln 𝜋 (𝑎′𝑡 |𝑠𝑡 ; 𝜃) — are fully aligned with the Loss func-                           of different subnets or nodes to interfere with each other,
tion
   (︁ depicted in  )︁ the graph. It’s important    to note
                                                        )︁ that                          thus reducing training efficiency. To solve this problem, the
                                                                                         experience playback pool can be divided into multiple re-
                                          (︁
𝑞0 𝑠𝑡 , 𝑎′𝑡 ; 𝑤(0) can be replaced by 𝑞1 𝑠𝑡 , 𝑎′𝑡 ; 𝑤(1) , since
                                                                                         gional playback pools (each region corresponds to a subnet
both Q-critic networks function equivalently.Based on the
                                                                                         or node) to store and utilize the experience more targeted.
idea of Double DQN, SAC uses two Critic networks, but each
                                                                                            Divide the experience pool: The experience pool is di-
time a Critic network is used, it picks a network with a small
                                                                                         vided into multiple subpools based on the network topology.
value, thereby alleviating the problem of overestimation.
                                                                                         Each subpool is dedicated to storing experiences from a
   The symbol ℬ represents the experience buffer, meaning
                                                                                         particular subnet or node. This division ensures that the
that when calculating the Loss, you need to take the average
                                                                                         experience within each region is more homogeneous and
of the samples drawn from the buffer. This ensures that
                                                                                         avoids the mixing of different regional experiences.
the expected average meaningfully represents the sample’s
                                                                                            Intra-zone sampling: At each policy update, SAC agents
overall outcome.
                                                                                         can sample experiences from the playback pool associated
   Here, 𝛼 is the entropy coefficient, which controls the
                                                                                         with the current zone of operation. For example, if the
importance of the entropy term ln 𝜋 (𝑎𝑡+1 |𝑠𝑡 ; 𝜃), and its
                                                                                         agent is currently in subnet A, it will preferentially sample
significance increases as 𝛼 increases.
                                                                                         from subnet A’s experience pool. This targeted sampling
                                                                                         method can improve the learning efficiency of the model,
3.3.2. Q Critic network design                                                           because the experience of sampling is more relevant and
Based on the optimal Bellman equation, we use 𝑈𝑡 =
                                                                            (𝑞)          can optimize the strategy within the region faster.
𝑟𝑡 + 𝛾𝑉 (𝑠𝑡+1 ) as the true value estimate for the state 𝑠𝑡 ,                               Cross-region sampling: In a partial update step, a pool
while the 𝑞𝑖 (𝑠𝑡 , 𝑎𝑡 ) value (where 𝑖 = 0, 1) is used as the                            of experience from different regions can be sampled to en-
predicted value estimate for state 𝑠𝑡 with the actual action                             hance the model’s adaptability to the global network. This
𝑎𝑡 . Finally, the MSE loss is used as the loss function to train                         approach balances the relationship between regional focus
the neural networks 𝑄0 and 𝑄1 .                                                          and global exploration, so that the model can not only focus
   Note that using MSE loss implies taking the average of                                on the strategy optimization of a specific region, but also
the data sampled from the experience buffer (denoted as ℬ)                               understand the dynamics of other regions and enhance the
across a batch, as follows:                                                              generalization ability.
                                                                                            Mitigate non-stationarity issues: In complex network
                                                                                         environments, the dynamics of subnets may not be synchro-
           1                                                                             nized. For example, some subnets may change frequently,
                        ∑︁              [︁ (︁              )︁      ]︁2
                                                               (𝑞)
 Loss =                                   𝑞𝑖 𝑠𝑡 , 𝑎𝑡 ; 𝑤(𝑖) − 𝑈𝑡
          |ℬ|
                (𝑠𝑡 ,𝑎𝑡 ,𝑟𝑡 ,𝑠𝑡+1 )∈ℬ
                                                                                         while others are relatively stable. By storing the experi-
                                                                                         ence pool in different regions, the negative impact of non-
                                                                                         stationarity on the model can be prevented and the stability
3.3.3. V Critic network design
                                                                                         of training can be effectively maintained.
Using the following equation with entropy for state value
estimation, the V critic network outputs the true value:                                 3.3.6. Prioritized Experience Replay

                                                                                          ]︂In reinforcement learning, not all experiences have the same
                                                                                        )︀ impact on policy improvement. Prioritized experience re-
                          [︂            (︁                     )︁
                                             𝑠𝑡 , 𝑎′𝑡 ; 𝑤(𝑖)
 (𝑣)                                                                        (︀ ′
𝑈𝑡 = E𝑎′𝑡 ∼𝜋(·|𝑠𝑡 ;𝜃)          min 𝑞𝑖                               − 𝛼 ln 𝜋 𝑎𝑡 |𝑠𝑡 ; 𝜃
                            𝑖=0,1                                                           play selects experiences that contribute more significantly
to policy improvement, helping the SAC algorithm make            Algorithm 1 Framework of ASAC Algorithm
efficient use of critical experiences and accelerate learning       Initialize Critic network parameters 𝜔1 , 𝜔2 and Actor
progress.                                                           network parameters 𝜃
                                                                    Initialize target network parameters 𝜔1− , 𝜔2−
    1. Priority Calculation: Each experience is assigned            Initialize regional experience replay buffers
       a priority, typically calculated based on the Temporal       ℛ1 , ℛ2 , . . . , ℛ𝑁
       Difference (TD) error:                                       Initialize temperature parameter 𝛼
                     ⃒                                    ⃒         for each episode do
                                      ′     ′
                                                                       Sample initial state 𝑠1 , determine the current region
                     ⃒                                    ⃒
               𝛿 = ⃒⃒𝑟 + 𝛾 max ′
                                  𝑄(𝑠   , 𝑎   ) − 𝑄(𝑠, 𝑎) ⃒
                              𝑎
                                                                       for each time step 𝑡 = 1 → 𝑇 do
                                                          ⃒

       The greater the TD error, the higher the potential                 Sample action 𝑎𝑡 ∼ 𝜋𝜃 (·|𝑠𝑡 ) from the policy net-
       for policy improvement, and therefore, the higher                  work
       the priority.                                                      Execute action 𝑎𝑡 , observe reward 𝑟𝑡 and next state
    2. Experience Sampling: During sampling, experi-                      𝑠𝑡+1
       ences with higher priority are more likely to be cho-              Store (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡 , 𝑠𝑡+1 ) in the corresponding re-
       sen. This can be achieved by probability sampling,                 gional replay buffer ℛ𝑖
       where the sampling probability 𝑃 (𝑖) is set propor-                for training steps 𝑘 = 1 → 𝐾 do
       tional to the priority level 𝑝𝑖 . For example, the sam-               Sample a mini-batch {(𝑠𝑖 , 𝑎𝑖 , 𝑟𝑖 , 𝑠𝑖+1 )}𝑁  𝑖=1
       pling probability 𝑃 (𝑖) can be defined as:                            from all regional replay buffers using prioritized
                                                                             sampling
                                   𝑝𝑖 𝛼                                      Calculate the TD error 𝛿𝑖              =     𝑟𝑖 +
                         𝑃 (𝑖) = ∑︀      𝛼
                                    𝑘 𝑝𝑘                                     𝛾 min𝑗=1,2 𝑄𝜔− (𝑠𝑖+1 , 𝑎𝑖+1 ) − 𝑄𝜔𝑗 (𝑠𝑖 , 𝑎𝑖 )
                                                                                             𝑗
       where 𝑝𝑖 is the priority of experience 𝑖, and 𝛼 con-                  Update priority for each sample based on 𝛿𝑖 and
       trols the level of prioritization. When 𝛼 = 1, sam-                   adjust sampling probabilities
       pling is fully prioritized by priority; when 𝛼 = 0,                   Calculate target value 𝑦𝑖          =      𝑟𝑖 +
       sampling is uniform.                                                  𝛾 min𝑗=1,2 𝑄𝜔− (𝑠𝑖+1 , 𝑎𝑖+1 )
                                                                                             𝑗
    3. Importance Sampling Weight Correction: To                            Update Critic networks: for 𝑗 = 1, 2, minimize
       correct for sampling bias, higher-priority experi-                   the loss function
       ences are given lower weights in the gradient calcu-                 Update Actor network using reparameterization
       lation. The importance sampling weight 𝑤(𝑖) can                      trick to sample action 𝑎
                                                                                                   ˜𝑖
       be used to balance this bias:                                        Update temperature parameter 𝛼 to minimize
                               (︂               )︂𝛽                         entropy objective
                                       1                                  end for
                      𝑤(𝑖) =
                                    𝑁 · 𝑃 (𝑖)                             Update target network parameters:
       where 𝑁 is the size of the experience buffer, and 𝛽
       is a parameter that adjusts the degree of importance-                     𝜔𝑗− ← 𝜏 𝜔𝑗 + (1 − 𝜏 )𝜔𝑗− ,    𝑗 = 1, 2
       sampling correction. When 𝛽 = 1, the correction is
                                                                       end for
       fully applied. Typically, 𝛽 starts from 0 and gradually
                                                                     end for
       increases to avoid instability due to high weights
       early in training.

    The flow of the improved ASAC(Attacker Soft Actor            discover all hosts in the subnet; (2) operating system scan
Critic) algorithm is as Algorithm 1.                             (os_scan) to obtain the operating system type of the target
    Line 1-4: Initialize the Critic, Actor, and target network   host; (3) service scan (service_scan) to obtain the service
parameters, along with regional experience replay buffers        type of the target host; (4) process scan (process_scan) to
and temperature parameter. Line 5: Start the loop for each       obtain information about processes on the target host. By
episode, sampling the initial state and determining the re-      performing the scan actions, the PT Agent can acquire the
gion. Line 7-10: At each time step, sample an action, execute    corresponding host information as an observed state. The
it, observe the outcome, and store the experience in the re-     VE actions and PE actions must be performed based on
spective regional replay buffer. Line 11-19: Perform training    the specific requirements. In addition, a certain probability
steps by sampling a mini-batch with prioritized experience       of success is set according to the Common Vulnerability
replay, calculating TD errors and priorities, updating Critic    Scoring System (CVSS) [12] to simulate the uncertainty of
and Actor networks, and adjusting temperature. Line 20:          attacks in reality. By configuring different VE actions and PE
Update the target networks with a soft update mechanism.         actions, the PT agent is modelled with different attack capa-
                                                                 bilities, for example as shown in Table 1. The results of the
                                                                 attack actions are simulated by updating the compromised
4. Experiment and Discussion                                     state information of the target host.
                                                                    The network topology diagram of the simulation experi-
4.1. network configuration                                       ment in this paper is shown in Figure 3: A total of 6 subnets,
                                                                 each subnet is set between the firewall and access protocol,
To simulate a realistic observation-based penetration pro-
                                                                 the yellow host is the sensitive host in the network, the
cess, we assume that both topology and host information
                                                                 attacker has mastered the two available nodes (1,0), (6,0)
must be obtained through scanning or feedback from attack
                                                                 and 4 scanning means, VE, PE and other actions to finally
actions. Therefore, the Information Gathering step is car-
                                                                 obtain the sensitive node permissions and related sensitive
ried out through 4 steps: (1) subnet scan (subnet_scan) to
    Table 1
    Attack Actions
                                                           Prerequisites                Results
                             Type     Name
                                                    OS         Service     Process Prob Privileges
                                      E_SSH        Linux         SSH          —     0.9     User
                              VE      E_FTP      Windows         FTP          —     0.6     User
                                     E_HTTP          —          HTTP          —     0.9     User
                                    P_Tomcat       Linux          —        Tomcat    1      Root
                              PE
                                    P_Daclsvc Windows             —        Daclsvc   1      Root
                             Attack capabilities configuration in Figure 3


information, the vulnerability information of the network
is shown in the Figure 3. The host configuration is shown
in Table 2.




                                                                   Figure 4: Different params combination Tests


Figure 3: Network topology sample
                                                                   of hyperparameters, including learning rate, discount fac-
                                                                   tor, batch size, and so on. Through many experiments, we
                                                                   identify a set of optimal hyperparameters and verify them
Table 2                                                            in the network environment.
Host Configurations                                                   Figure 4 shows the performance comparison of ASAC
                                                                   algorithm under different parameter Settings. The experi-
    Host          OS            Services       Processes
                                                                   mental results of each group are smoothed. The experimen-
    (1, 0)       Linux             http           —                tal results show that the proxy training obtained by group4
    (2, 0)       Linux           ssh, ftp       tomcat             group parameters cannot achieve balance, and basically each
    (3, 0)      Windows             ftp           —                training round cannot reach the optimal condition. But it
    (3, 1)       Linux             ssh            —                doesn’t converge yet. After multiple tuning, the optimal
    (4, 0)      Windows            http         daclsvc            parameter group group1 is found. After parameter tuning,
    (5, 0)       Linux           ftp, ssh         —                the ASAC algorithm shows good performance in the net-
    (5, 1)      Windows             ftp         daclsvc            work environment and can converge in about 300 rounds.
    (6, 0)       Linux             http         tomcat             It can be seen that the optimized algorithm has a significant
                                                                   improvement in convergence speed and final performance.

                                                                   4.2.2. RQ2: Compared with previous algorithms,
4.2. Tests and Analyses                                                   does the improved ASAC algorithm have better
                                                                          performance in the network environment?
To validate the performance of our model and decision
method, we focus our experiments on answering the follow-          In order to verify the performance advantages of the im-
ing three Research Questions (RQs):                                proved ASAC algorithm in the network environment, we
                                                                   compare the ASAC algorithm with several classical network
4.2.1. RQ1: indicates whether the ASAC algorithm is                optimization algorithms, including Q-learning and DQN
       feasible in the network environment after                   (Deep Q-Network), which have been studied in intelligent
       parameter tuning?                                           penetration testing.
                                                                      Figure 4 shows the change of the cumulative reward value
In this part, we first perform detailed parameter tuning for       of each round of different algorithms with the number of
the ASAC algorithm, and a total of four groups of parame-          training rounds in the network environment. The learning
ters are set, with detailed data shown in Table 3. The goal of     goal of the agent is to learn to use fewer steps to obtain the
parameter tuning is to find an optimal set of hyperparame-         permissions of all sensitive hosts in the target network, so
ters to ensure the stability and effectiveness of the algorithm    as to obtain the reward value, so the reward value can be
in the network environment. We used a combination of grid          used to measure the level of the agent’s strategy. It can be
search and random search to explore different combinations         found from the Figure 5 that in the initial stage of training,
     Table 3
     Hyperparameter Tuning Table Group
                                  Hyperparameter            group1     group2      group3     group4
                               Actor Learning Rate (𝛼)       1e-3       1e-3        1e-3       1e-4
                              Critic Learning Rate (𝛼)       1e-3       1e-3        1e-3       1e-4
                                 Discount Factor (𝛾 )          0.9       0.95        0.95       0.95
                                Standard Batch Size           128         64          32         32
                            Prioritized Replay Batch Size     128         64          32         32
                               Target Update Rate (𝜏 )       5e-3       5e-2        5e-2       1e-3
                                  Replay Buffer Size          3e5        3e5         3e5        3e5
                            Prioritization Parameter (𝛼)       0.8        0.8        0.8        0.8
                                Regional Buffer Size         5000       5000        5000       5000
                                  Exploration Noise           0.05       0.05        0.1        0.1



                                                                     and complexity. Table 4 shows five different scale network
                                                                     environment configurations used in the experiment.
                                                                        The experimental networks range in size from 8 to 64
                                                                     nodes and cover different scenarios from small Lans to large
                                                                     WAN networks. The main indicators we focus on are the
                                                                     score of ASAC algorithm under different network condi-
                                                                     tions, and the convergence speed.

                                                                     Table 4
                                                                     Network Environment Specifications
                                                                      Env Name           Nodes      Subnets      Sensitive Nodes
                                                                      Environment A        8           5              2*100
                                                                      Environment B        8           6              2*100
Figure 5: Different algorithms Tests                                  Environment C        9           6              2*100
                                                                      Environment D       32           8              5*100
                                                                      Environment E       40          10             10*100

the reward value that the agent can obtain in each turn is
                                                                        As can be seen from Figure 6, in the three heterogeneous
small, but with the progress of training, the reward value
                                                                     networks A, B and C with the same sensitive host and
keeps increasing, indicating that the agent gradually learns
                                                                     slightly different summary points and subnets, the ASAC
the strategy of obtaining the maximum reward value. In the
                                                                     algorithm agent has good robustness, and its average cu-
early exploration process, the cumulative reward value of
                                                                     mulative reward value can converge to the optimal value
ASAC algorithm per round is significantly lower than that
                                                                     within 400 rounds in the three similar scenarios.
of DQN and Q-learning algorithm. This is because in the
                                                                        As the number of subnets and hosts increases, the con-
early exploration process of ASAC algorithm, the agent has
                                                                     vergence rate of the algorithm slows down as the num-
no experience to rely on and can only use random strategies
                                                                     ber of hosts and actions that the agent needs to attempt
for exploration. It also wastes a lot of attack steps, but with
                                                                     in each turn increases rapidly. Figure 7 shows the perfor-
the progress of training Q-learning can not get better results
                                                                     mance of ASAC algorithm under different network scales.
due to the explosion of Q table, and the convergence speed
                                                                     It can be seen that the performance of ASAC algorithm de-
and effect of DQN are not as good as that of ASAC algo-
                                                                     creases slightly with the increase of network scale, but it can
rithm. Finally, both ASAC and DQN algorithms converge
                                                                     still effectively deal with complex problems in large-scale
to a stable cumulative reward value within 600 rounds, and
                                                                     network environment. In the simulated network environ-
ASAC algorithm has the best performance and can converge
                                                                     ment of 40 hosts,ASAC algorithm can also converge to the
to the optimal value within 300 rounds.
                                                                     optimal solution. The experimental results show that the
   The experimental results show that the improved ASAC
                                                                     ASAC algorithm maintains good performance when the
algorithm has obvious advantages in network environment.
                                                                     network size increases gradually. In addition, ASAC algo-
Including convergence speed, decision quality and resource
                                                                     rithm also performs well in resource utilization, and can
utilization. ASAC algorithm outperforms other algorithms
                                                                     maintain high resource utilization under different network
in all indexes, especially in the environment dealing with
                                                                     sizes. When dealing with complex network topology and
large-scale networks and high dynamic changes.
                                                                     high load, ASAC algorithm can make decisions effectively
                                                                     to ensure the stability and efficiency of the network.
4.2.3. RQ3: Is the proposed ASAC scalable across
       different network sizes?
In order to verify the scalability of ASAC algorithm, we con-
                                                                     5. Conclusion
ducted experiments in different scale network environments.          Through the experimental analysis, we can draw the follow-
The experiment was carried out in a network simulation en-           ing conclusions:
vironment to simulate network topologies of different sizes            These experimental results validate the effectiveness and
                                                                     automated penetration testing of large networks[J].
                                                                     Journal of Intelligent Information Systems, 2023, 60(2):
                                                                     281-303.
                                                                 [3] Wang Y, Li Y, Xiong X, et al. DQfD-AIPT: An Intelligent
                                                                     Penetration Testing Framework Incorporating Expert
                                                                     Demonstration Data[J]. Security and Communication
                                                                     Networks, 2023, 2023(1): 5834434.
                                                                 [4] ZANG Y C, ZHOU T Y, ZHU J H, et al. Domain-
                                                                     Independent Intelligent Planning Technology and Its
                                                                     Application to Automated Penetration Testing Ori-
                                                                     ented Attack Path Discovery[J]. Journal of Electronics
                                                                     & Information Technology, 2020, 42(9): 2095-2107.
                                                                 [5] Ma Q, Liu Y, Wu X S, Qu Y, Wang B L, Liu H R. Optimal
Figure 6: Reward in similar network environment
                                                                     Penetration Path Discovery Based on Value Iterative
                                                                     Algorithm[J]. Computer Systems and Applications,
                                                                     2023, 32(12): 197-204.
                                                                 [6] KANG Haiyan, LONG Molan, ZHANG Congming. Re-
                                                                     view on the Application of Automated Penetration
                                                                     Testing[J]. Journal of Cybersecurity, 2023, 1(2): 59-72.
                                                                 [7] ZHOU T, ZANG Y, ZHU J, et al. NIG-AP: a new method
                                                                     for automated penetration testing[J]. Frontiers of In-
                                                                     formation Technology & Electronic Engineering, 2019,
                                                                     20(9): 1277-1288.
                                                                 [8] Schwartz J, Kurniawati H. Autonomous penetration
Figure 7: Reward in complex network environment                      testing using reinforcement learning[J]. arXiv preprint
                                                                     arXiv:1905.05965, 2019.
                                                                 [9] Zhou S, Liu J, Zhong X, et al. Intelligent Penetration
                                                                     Testing Path Discovery Based on Deep Reinforcement
superiority of ASAC algorithm, provide an effective tool
                                                                     Learning[J]. Computer Science, 2021, 48(07): 40-46.
for network intelligent penetration attack agents, and also
                                                                [10] Nguyen H V, Teerakanok S, Inomata A, et al. The Pro-
provide evaluation support for the security subfeature "re-
                                                                     posal of Double Agent Architecture using Actor-critic
sistance" defined in ISO/IEC 25000 SQuaRE product quality
                                                                     Algorithm for Penetration Testing[C]//ICISSP. 2021:
model. It provides strong support for the future application
                                                                     440-449.
in the field of network optimization.
                                                                [11] Zeng Q, Zhang G, Xing C, et al. Intelligent Attack
   Quantitative analysis reveals that the ASAC algorithm
                                                                     Path Discovery Based on Hierarchical Reinforcement
not only outperforms traditional methods like Q-learning
                                                                     Learning[J]. Computer Science, 2023, 50(07): 308-316.
and DQN but also demonstrates robust performance under
                                                                [12] CVSS. https://www.first.org/cvss/v4.0/specification-
varying network conditions, contributing to a more reliable
                                                                     document. Available online: Feb 1, 2024.
and secure network optimization solution. The ability to
adapt and maintain high security performance across dif-
ferent network scales and configurations provides strong
support for future applications in network security and op-
timization.
   However, the algorithm has certain limitations. One no-
table drawback is the lack of comparison between the ASAC
algorithm and other currently optimized attack agent algo-
rithms, which would provide a clearer benchmark for its rel-
ative performance. Despite its promising performance and
stability, the algorithm may still face convergence speed is-
sues in highly dynamic environments. Moreover, the param-
eter tuning process remains relatively complex, potentially
increasing the difficulty of practical implementation. Future
work should focus on developing more efficient adaptive
parameter adjustment mechanisms to improve the ASAC al-
gorithm’s adaptability and robustness in even more complex
and unpredictable environments.


References
 [1] Greco C, Fortino G, Crispo B, et al. AI-enabled IoT
     penetration testing: state-of-the-art and research chal-
     lenges[J]. Enterprise Information Systems, 2023, 17(9):
     2130014.
 [2] Ghanem M C, Chen T M, Nepomuceno E G. Hierarchi-
     cal reinforcement learning for efficient and effective