<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riddam Rishu</string-name>
          <email>rrishu@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akshay Kakkar</string-name>
          <email>akshkakkar@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cheng Wang</string-name>
          <email>chengwang@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdul Rahman</string-name>
          <email>abdulrahman@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher Redino</string-name>
          <email>credino@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dhruv Nandakumar</string-name>
          <email>dnandakumar@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tyler Cody</string-name>
          <email>tcody@vt.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryan Clark</string-name>
          <email>ryanclark4@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Radke</string-name>
          <email>dradke@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward Bowen</string-name>
          <email>edbowen@deloitte.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CAMLIS'23: Conference on Applied Machine Learning for Information Security</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Security Institute</institution>
          ,
          <addr-line>Virginia Tech</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Building on previous work using reinforcement learning (RL) focused on identification of exfiltration paths, this work expands the methodology to include protocol and payload considerations. The former approach to exfiltration path discovery, where reward and state are associated specifically with the determination of optimal paths, are presented with these additional realistic characteristics to account for nuances in adversarial behavior. The paths generated are enhanced by including communication payload and protocol into the Markov decision process (MDP) in order to more realistically emulate attributes of network based exfiltration events. The proposed method will help emulate complex adversarial considerations such as the size of a payload being exported over time or the protocol on which it occurs, as is the case where threat actors steal data over long periods of time using system native ports or protocols to avoid detection. As such, practitioners will be able to improve identification of expected adversary behavior under various payload and protocol assumptions more comprehensively.</p>
      </abstract>
      <kwd-group>
        <kwd>reinforcement learning</kwd>
        <kwd>exfiltration</kwd>
        <kwd>penetration testing</kwd>
        <kwd>cyber terrain</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>CEUR
exfil trafic through domain name system, (DNS)) to deter detection while obfuscating intent
[8, 7, 6, 9, 5].</p>
      <p>The previous literature’s drawbacks are discussed in [1] and the topic of using RL for
conducting post-exploitation activities such as exfiltration is still under-studied. Previous work
[4], for example, employs ontological models of the agent with actions defined using common
software modules. While this may be useful in some capacity, it sufers from aligning to network
structure, path structure, and cyber terrain, thereby limiting its ability to anchor agents to the
real computer network. In addition, the output is not operationally interpretable where security
operations center (SOC) and cyber analysts can action results [6, 9, 7, 5].</p>
      <p>This paper presents a framework for using RL methods for discovering exfiltration paths in
network models while accounting for attacker preferences in payload and protocol preferences.
The key contributions of this paper are twofold:
1. An approach for modeling data exfiltration on networks that accounts for choices in
2. The implementation of RL-based algorithms for discovering exfiltration paths in network
protocol with varying size of payload.</p>
      <p>models.</p>
      <p>The presented methodology is aligned with a focus on network structure and configuration,
path analysis, and cyber terrain. Its outcomes can be directly understood as paths through
networks, as is highlighted in a detailed discussion of the results. To support reproducibility,
the RL solution methods, experimental design, and network model are specified in great detail.</p>
      <p>The remainder of this work begins with a background on the use of RL for penetration testing
followed by an exploration of the methods for modeling defensive terrain and discovering
exfiltration paths. Then, the experimental design is described for evaluating the proposed
approach, followed by an analysis of the experimental results, and a discussion of the findings.
Lastly, this paper concludes with remarks on modeling decisions, a summary of the work, and
possible avenues of future research.</p>
    </sec>
    <sec id="sec-3">
      <title>2. RL and Penetration Testing</title>
      <sec id="sec-3-1">
        <title>2.1. Reinforcement Learning</title>
        <p>RL is a framework where an agent learns to optimize its behaviour by interacting with its
environment [10]. A Markov decision process (MDP): ( ,  ,  ,  ,  )
environment, where  is the state space,  is the action space,  ∶  ×  → 
is often used to model the
is the transition
function,  ∶  ×  ×  → ℝ</p>
        <p>is the reward function and  ∈ (0, 1] is the discount factor,
which determines the present value of future rewards. The agent’s behavior is characterized by
its policy  , which is a probabilistic distribution over actions given a state. For deterministic
policies, the action taken in state  can be denoted as  () . Corresponding to each time step, the
agent observes a state   , on which it takes an action   according to  (|
 ), and transitions to
∞
a new state  +1 and receives a reward   =  (  ,   ,  +1 ). The cumulative discounted reward is

called the return and is defined as   = ∑=0   + . The RL agent aims to learn an optimal policy
 ∗, which maximizes the expected return from each state. RL algorithms can be categorized
into three groups: value function-based (or critic-only) methods, policy gradient (or actor-only)
methods, and actor-critic methods.</p>
        <p>Value function-based methods such as  -learning [11] or deep Q-networkx (DQN) [12] learn
optimal policies by first estimating the optimal action-value function  ∗(, ) :
 ∗(, ) ≡
max   (, )


≡ max   [  |  = ,   =  ],
which can be obtained by solving the Bellman equation:
 ∗(, ) =   ′[ +  max  ∗( ′,  ′)|,  ].</p>
        <p>′
ℒ () =  [ min (  ()  , clip(  (), 1 − , 1 +  )  )],
where   () =   (  |

)/  old</p>
        <p>(  |  ) is the probability ratio of the new policy over the old policy.</p>
        <p>The advantage function   is often estimated using the generalized advantage estimation [ 15],
truncated after  steps:
 ̂ =   + ( ) +1 + ⋯ + ( )</p>
        <p>−+1   −1 ,
where   =   +   ( +1 ) −  (  ).</p>
        <p>
          To support exploration, an entropy bonus  ()
is often added to the objective function (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ),
where  is a coeficient.
        </p>
        <p>Then, an optimal policy  ∗ is derived by selecting the action that yields the largest  -value:
 ∗() =   ∗(, ).</p>
        <p>On the other hand, policy gradient approaches focus on directly parameterizing the policy
 (|; )
and optimizing a performance measure  ()
such as the expected return [
gradient ascent. Such methods often sufer from high variance and may result in slow
learn ] via
ing. Thus, to reduce the variance, actor-critic methods use an estimate of the value function
  () ≡   [  |  = ] as a baseline when estimating the policy gradient ∇ ()
[13]. The critic is
responsible for learning the value function while the actor updates policy parameters by using
the estimated policy gradient. In particular, the policy gradient can be estimated as
∇ () ≈  [∇ log  (  | 
; )  ],
where   = (  ,   ) −  (  ) represents the advantage of taking action   at state   .</p>
        <p>
          Policy gradient methods are prone to performance collapse as a result of large policy updates,
which can be challenging to recover from because the agent will have been trained on the
experience produced by bad policies. To improve training stability, Proximal Policy Optimization
(PPO) [14] uses a clipped surrogate objective function:
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
(
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. RL applications in penetration testing</title>
        <p>Deep RL has been applied to cybersecurity broadly [13], but only recently it has been employed
as a tool for penetration testing [16, 17, 18, 19, 20, 21, 22, 23]. There are a number of diferent
approaches, but most only consider privilege escalation on a target host as the learning task.
Gangupantulu et. al proposed to use concepts of cyber terrain to help enrich task design and
reward shaping [23]. This concept spurred the development of several task-specific uses of RL
for penetration testing, including crown jewel analysis [24], discovering exfiltration paths [ 1],
and exposing surveillance detection routes [25].</p>
        <p>As with Gangupantulu et al.[24], the presented RL approach here solves a more complex
task and acts as a focused tool for cyber operators to increase the efectiveness of operator
workflow in penetration testing. RL for penetration testing has made frequent use of DQN
[17, 21, 22, 23, 24]. As an alternative, Nguyen et al. proposed an RL-based approach that makes
use of two agents: one for iteratively scanning the network to build a structural model and
another for exploiting the constructed model [26]. In this study, Nguyen et al.’s double agent
architecture is combined with the PPO algorithm to train the RL agents.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Methods</title>
      <p>In this section, we present the details of the exfiltration model, the protocol-based path selection
criteria, and the complete RL formulation. While the model incorporates several assumptions,
it’s fundamentally based on a data-driven approach using scan data. This reliance on scan data
not only ensures empirical robustness but also permits iterative refinements, as newer or more
comprehensive data become available, to progressively approach a more accurate representation
of reality.</p>
      <sec id="sec-4-1">
        <title>3.1. Exfiltration Simulation Overview</title>
        <p>The approach proposed here expands on previous models for data exfiltration in that it can
model paths for diferent payload sizes. The exfiltration campaign is modeled based on three
tasks consisting of (i) Connection, (ii) Path Selection and (iii) Exfiltration. The attacker initially
attempts to gain control of some of the known target hosts which are externally connected via
an internet connection to serve as the point of exfiltration. Once control of the target host is
gained, an exfiltration path is selected based on the preferences for an exfiltration protocol. The
attacker then tries to exfiltrate data packets from the compromised host. The three tasks are
designed to function so that if an attacker discovers a new exfiltration host, the path selection
module determines whether a better path exists and adjusts the exfiltration path accordingly.</p>
        <p>The agent explores the network and gathers information on neighboring hosts by taking the
subnet scan action. In order for the scan to be successful, the agent must first gain access to the
underlying host, which can be achieved by executing an exploit action. Multiple exploits may
exist for a given machine, with each targeting a specific Common Vulnerabilities and Exposures
(CVE) vulnerability.</p>
        <p>Once a foothold is gained on a new host, the agent updates candidate exfiltration paths that
consist of each of the compromised hosts. It will then decide which path is preferred to carry out
the exfiltration based on the predetermined Exfiltration protocol strategy. If another target is
captured later, the path selection task evaluates the new paths available and, if a new preferred
path is discovered, the path is updated and the payload is reset to its original value.</p>
        <p>After identifying the preferred exfiltration path, the agent can start sending parts of the
payload to the exit node. The task is completed if the entire payload is uploaded from the initial
node. In order to evade firewall detection, the agent should avoid frequent and large uploads.
To hide its activity, the agent may take a sleep action that simply does nothing for a period of
time.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Network Firewalls</title>
        <p>As in [27], any exfiltration trafic will be monitored by network firewalls, which are placed
between each of the subnets and the public Internet. Upon detection of unusual trafic patterns,
the administrator will be alerted and an emergency firewall update will be conducted. Examples
of suspicious activities include the following:
• the total egress volume exceeds max_upload_volume;
• the total active time surpasses max_upload_time.</p>
        <p>Firewalls are also updated periodically. In particular, a wall-clock is introduced to simulate
the real time of an attack campaign. Diferent actions will increase the clock time by diferent
amounts depending on the their complexity. Both the regular update and the emergency update
will patch the vulnerabilities and block the outbound trafic from the compromised hosts.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Protocol-Based Path Selection</title>
        <p>Exfiltration activities within attacker campaigns are typically carried out by exploiting a common
protocol as these are deemed generally safer and less likely to be detected by security monitoring.
Standard protocols, such as Hypertext Transfer Protocol Secure (HTTPS), are often used to
carry out data exfiltration. By using common protocols used by enterprise applications, it’s
more likely these protocols are available. It’s also more likely these protocols are not monitored
as closely by security detection methods. As an example, by using the same protocol used by
databases to backup their data to cloud services, attackers emulate the database backup expected
by security rules and do not raise alerts in monitoring systems.</p>
        <p>Path selection is determined by maximizing the utilization of this protocol across as many
hosts in an exfiltration network path as possible. This is not always the shortest path. A
path maximizing the use of the chosen protocol is often more advantageous, even when this
path touches more nodes in the victims network. The path selection algorithm accounts
for contingencies when end-to-end use of the designated exfiltration protocol is unavailable.
The algorithm prioritizes finding a complete path using the given protocol over a shortest
path possible. The next level of criterion considered are the length of the path and rewards
accumulated. If multiple paths are identified with the same exfiltration protocol coverage, the
shortest path will be prioritized. When no complete path can be created using the exfiltration
protocol, the algorithm searches for the shortest path exposed to the maximum use of the
protocol. The reward function calculates the highest rewarded path using existing reward
mechanisms, shortest number of hosts, and maximum use of the exfiltration protocol.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Reinforcement Learning Formulation</title>
        <sec id="sec-4-4-1">
          <title>3.4.1. State Space</title>
          <p>The state has the following features for every host:
• Address,
• Operating system,
• Services and processes,
• Discovery value and status,
• Infection value and status,
• Access level information.</p>
          <p>Host’s address is denoted by its subnet ID and local ID. The operating system, service and
process features have a value of one of if they are present at the host and zero otherwise.
Similarly, the discovery and infection status are one if the host is discovered or compromised
and zero otherwise. The discovery and infection values represent the reward for successfully
discovering and compromising a host, respectively. Additional features are defined for target
hosts:
• Connection status,
• Time since infection,
• Remaining payload size,
The connection status can be connected, not connected, or isolated (i.e., blocked by firewalls).
The time since infection is measured by the wall-clock rather than time steps. Finally, the
remaining payload size indicates how much left to upload. The exfiltration task is complete
when the remaining payload size becomes zero.</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>3.4.2. Action Space</title>
          <p>There are four types of actions for the RL agent: subnet scan, exploit, upload, and sleep. Each
action requires specification of a target host, except for the sleep action, which simply does
nothing for a given period of time. Multiple exploits targeting at diferent vulnerabilities may
be available for a given host. Two uploading actions with diferent speed are available at each
target - one with a rate of 100MB/s and another with rate 1MB/s.</p>
          <p>Clock-time increases diferently based on the action’s result and complexity. Table 2 lists the
assigned clock time for each action. For not applicable actions, such as performing as subnet
scan without access to the underlying host, the clock time will only move forward by one
second.</p>
        </sec>
        <sec id="sec-4-4-3">
          <title>3.4.3. Reward Function</title>
          <p>The reward function consists of a positive value for achieving sub-goals such as discovering
or exploiting a host and a negative value that accounts for the action’s cost. An action with
higher cost is more likely to trigger the defense terrain. Specifically, we follow the approach
in [1] and assign action’ cost based on the services running on the target system. The idea is
that even though the adversaries may not know the exact defense mechanism or strength, they
can still infer the presence of defense based on the host’s service information. In particular, we
categorized the services into three groups, high-risk, medium-risk, and low-risk. The actual
cost of an action then depends on its type (scan, exploit, or upload) and the target’s service
profile.</p>
          <p>Rewards are given based on how much of the exfiltration path chosen is covered by the
exfiltration protocol, For example if out of the 6 hosts in the exfiltration path 3 hosts have
exfiltration protocol running then 50 percent of the reward configured will be given to the
agent. The agent receives positive reward on uploading a partial payload from the infected host,
upon finishing sending the entire payload, the agent is given a large bonus reward. However, if
exfiltration is detected by network firewalls, then the agent will receive a penalty equal to the
total accumulated rewards gained on the originating host and the host will be isolated. That
is, the agent will lose all rewards from discovery, infection and partial uploads. Table 3 lists
rewards used in this study.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>In this section we present the experiment details and the results, and discuss key characteristics
of the attack paths learned by the RL agent.</p>
      <sec id="sec-5-1">
        <title>4.1. Network Description</title>
        <p>
          We have designed two experiment networks. The first experiment network has 10 subnets and
a total of 56 hosts. Each subnet contains between 3 and 12 hosts. The attacker agent is assumed
to have gained an initial foothold on host (
          <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
          ) in subnet 8, which is not directly connected to
the Internet. One particular machine (
          <xref ref-type="bibr" rid="ref2">2, 0</xref>
          ) from subnet 2 is designated as the exfiltration host.
Subnet 2 is directly accessible from the internet whereas, other subnets are private and are not
directly accessible from the Internet. The exfiltration host has Dynamic Host Configuration
Protocol Server (DHCPS) running as a service, which is chosen as the Exfiltration Protocol.
The second experiment network has 101 subnets and a total of 1444 hosts. This network is
remarkably bigger than the one used previously. Each subnet contains between 3 and 50 hosts.
The attacker agent is assumed to have gained an initial foothold on host (
          <xref ref-type="bibr" rid="ref5">44, 5</xref>
          ) in subnet 44,
which is not directly connected to the Internet. A host connected to the internet (
          <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
          ) from
subnet 5 is designated as exfiltration host. The exfiltration host has running HTTPS service,
which is chosen as the Exfiltration Protocol.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Training Details</title>
        <p>The RL agent is trained in an episodic fashion for both the networks using the well-known
PPO algorithm. An episode ends when the initial host either completes sending payload to the
exfiltration host or is isolated by firewalls. The target payload is set to be 10,000MB. Both the
actor and the critic are approximated by a two-layer feed-forward neural network, where the
ifrst layer has 64 neurons, and the second layer has 32 neurons. Other key hyperparameters
are listed in Table 4. For the first network the RL agent is trained for 800 episodes and for the
second network RL agent is trained for 1000 episodes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results</title>
      <p>For the first network, episode rewards over training runs are presented in Fig. 1a and episode
length in Fig. 1b, and for the second network, episode rewards over training runs are presented
in Fig. 2a and episode length in Fig. 2b. Training is observed to be stable for both networks,
and the RL policy converges in 800 episodes for the first network and in 1000 episodes for the
second network. Fig. 1a shows that the sum of rewards in an episode for the first network
steadily increases to almost 12,000, and Fig. 2a shows that the sum of rewards in an episode for
second network steadily increases to a little more than 10,000. During the same intervals, the
episode length gradually decreases for both simulations. This suggests that as training goes on,
the RL agent completes the attack task more eficiently and takes fewer random actions.
(a)
(b)</p>
      <p>Due to the stochastic nature of the learned policy, the RL agent may take some unnecessary
or redundant actions such as exploiting unimportant hosts or subnet scans. After pruning the
output trajectory, key steps in the attack for the simulation of the first network can be identified
as shown in Table 7.</p>
      <p>
        For the first network, the agent gains a foothold on host (
        <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
        ) in subnet 8, from which it
triggers a subnet scan which leads to the discovery of other hosts in the same subnet and in the
connected subnets, subnet 4 and subnet 6. The agent then exploits the host (
        <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
        ) in subnet 4
and it is chosen as a host for further exploitation to make an exfiltration path. A subnet scan is
triggered from the host (
        <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
        ), which discovers the hosts present in connected subnets i.e., subnet
2 and ultimately discovers the target or exfiltration host (
        <xref ref-type="bibr" rid="ref2">2, 0</xref>
        ), which is then compromised to
forge an exfiltration path i.e., (
        <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
        ) → (
        <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
        ) → (
        <xref ref-type="bibr" rid="ref2">2, 0</xref>
        ). In search of availability of better paths,
agent exploits the host (
        <xref ref-type="bibr" rid="ref6">6, 0</xref>
        ) in subnet 6, and triggers a subnet scan from that host, discovering
hosts on connected subnet i.e., subnet 5. This scan discovers host (
        <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
        ) in subnet 5 and is later
exploited to forge another exfiltration path i.e., (
        <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
        ) → (
        <xref ref-type="bibr" rid="ref6">6, 0</xref>
        ) → (
        <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
        ) → (2, 0.)
      </p>
      <p>
        The path explored earlier i.e., (
        <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
        ) → (
        <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
        ) → (
        <xref ref-type="bibr" rid="ref2">2, 0</xref>
        )is not a complete exfiltration
protocolbased path, since there is no DHCP (exfiltration protocol) service running on host (
        <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
        ) as
shown in Fig. 3. However, the second path discovered i.e., (
        <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
        ) → (
        <xref ref-type="bibr" rid="ref6">6, 0</xref>
        ) → (
        <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
        ) → (
        <xref ref-type="bibr" rid="ref2">2, 0</xref>
        )is a
complete exfiltration protocol-based path since the same service (i.e., DHCP) is running on both
hosts (
        <xref ref-type="bibr" rid="ref6">6, 0</xref>
        ) and (
        <xref ref-type="bibr" rid="ref1 ref5">5, 1</xref>
        ) as shown in Fig. 3. Noticeably, the agent chooses the second path over
ifrst path to upload payload as it is 100 percent protocol-based path and is the optimal path,
even though the first path discovered is shorter in length.
      </p>
      <p>
        For the second network, the agent has a foothold over the host (
        <xref ref-type="bibr" rid="ref5">44, 5</xref>
        ) in subnet 44. Upon
performing various subnet scans and exploits, the agent gets a hold over the host (
        <xref ref-type="bibr" rid="ref18">24, 18</xref>
        ), and
ultimately discovers and exploits the target or exfiltration host (
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ). This led to development of
exfiltration path i.e., (
        <xref ref-type="bibr" rid="ref5">44, 5</xref>
        ) → (
        <xref ref-type="bibr" rid="ref18">24, 18</xref>
        ) → (
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ). The host (
        <xref ref-type="bibr" rid="ref18">24, 18</xref>
        ) has HTTPS service running
on it, hence the path forged is a complete protocol-based path. The capability of the agent to
forge a 100 percent protocol-based path over such a big network indicates that the model is
scalable as well.
      </p>
      <p>For both networks the agent takes appropriate sleep actions in between the upload actions
so that there is no unusual trafic pattern and cyber defenses are not triggered.</p>
      <p>The agent found paths in both networks that utilize a single network protocol. In real-life
scenarios, attackers try to use a single protocol to avoid increasing attack complexity and
reduce the risks of inconsistencies or errors, which can lead to a greater possibility of detection.
Choosing to exfil data using existing network protocols that the network defenses (firewalls,
IDS) know about also reduces the risk of discovery by trafic anomaly detection algorithms.
Using standard protocols for exfiltration while considering trafic timing and volume replicates
previously documented Tactics, Techniques, and Procedures (TTP)s[28].</p>
      <p>While novel exfiltration methods that use non-standard protocols exist,Domain Name Service
(DNS), Network Time Protocol (NTP), or Internet Control Message Protocol (ICMP), they
typically require complex setup for execution [8]. They also are usually more closely monitored
by defensive measures for volume and anomalous behaviors than standard protocols due to
their usage in previous exfiltration operations [ 8]. Data exfiltration requires more network
volume and can be more stealthily sent over less strictly monitored or eccentric channels [29].</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>The current gap within the cybersecurity industry involves contextualizing and quantitatively
prioritizing the eficacy of deployed security controls to enable sense-making for security
practitioners and network defenders. In this paper, we address this gap through applying RL
for exfiltration path analysis enhanced by integrating protocol and payload considerations. Our
work demonstrates that an RL agent can efectively find an exfiltration path with maximum
exfiltration protocol coverage and can perform exfiltration using this preferred path without
being detected by security infrastructure (i.e., firewalls). Our results identify optimal paths
that provide insights for operators, analysts, and defenders to evaluate the value of currently
deployed security controls which influence (i.e., isolate or eliminate) the connections within
the path. As a result, the operations community can utilize this data to formulate task lists for
securing enterprise networks.</p>
      <p>This RL approach identified the most likely hosts and services used when exfiltrating data
while capturing variable metrics used in network risk assessments. The strength of this approach
was validated through identification of intentional network misconfigurations that mimic
realworld vulnerabilities. In future work we consider expanding the risk formalism to increase its
sophistication and maturity, which will drive increased applicability and relevance.
[22] Z. Hu, R. Beuran, Y. Tan, Automated penetration testing using deep reinforcement learning,
in: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&amp;PW),
IEEE, 2020, pp. 2–10.
[23] R. Gangupantulu, T. Cody, P. Park, A. Rahman, L. Eisenbeiser, D. Radke, R. Clark,
Using cyber terrain in reinforcement learning for penetration testing, arXiv preprint
arXiv:2108.07124 (2021).
[24] R. Gangupantulu, T. Cody, A. Rahman, C. Redino, R. Clark, P. Park, Crown jewels analysis
using reinforcement learning with attack graphs, arXiv preprint arXiv:2108.09358 (2021).
[25] L. Huang, T. Cody, C. Redino, A. Rahman, A. Kakkar, D. Kushwaha, C. Wang, R. Clark,
D. Radke, P. Beling, et al., Exposing surveillance detection routes via reinforcement
learning, attack graphs, and cyber terrain, arXiv preprint arXiv:2211.03027 (2022).
[26] H. V. Nguyen, S. Teerakanok, A. Inomata, T. Uehara, The proposal of double agent
architecture using actor-critic algorithm for penetration testing., in: ICISSP, 2021, pp.
440–449.
[27] C. Wang, A. Kakkar, C. Redino, A. Rahman, S. Ajinsyam, R. Clark, D. Radke, T. Cody,
L. Huang, E. Bowen, Discovering command and control channels using reinforcement
learning, in: SoutheastCon 2023, IEEE, 2023, pp. 685–692.
[28] Mitre att&amp;ck framework®, 2021. URL: https://attack.mitre.org.
[29] B. Sabir, F. Ullah, M. A. Babar, R. Gaire, Machine learning for detecting data exfiltration: A
review 54 (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cody</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Redino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kakkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kushwaha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Beling</surname>
          </string-name>
          , E. Bowen,
          <article-title>Discovering exfiltration paths using reinforcement learning with attack graphs</article-title>
          ,
          <source>in: 2022 IEEE Conference on Dependable and Secure Computing (DSC)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>N. I. of Standards</surname>
          </string-name>
          , Technology,
          <source>Security and Privacy Controls for Federal Information Systems and Organizations</source>
          ,
          <source>Technical Report NIST Special Publication 800-53</source>
          Revision 5, U.S. Department of Commerce, Washington, D.C.,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Conti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Raymond</surname>
          </string-name>
          ,
          <article-title>On cyber: towards an operational art for cyber conflict</article-title>
          , Kopidion Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Maeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mimura</surname>
          </string-name>
          ,
          <article-title>Automating post-exploitation with deep reinforcement learning</article-title>
          ,
          <source>Computers &amp; Security</source>
          <volume>100</volume>
          (
          <year>2021</year>
          )
          <fpage>102108</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Gharakheili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Raza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sivaraman</surname>
          </string-name>
          ,
          <article-title>Real-time detection of dns exfiltration and tunneling from enterprise networks, in: 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM)</article-title>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>649</fpage>
          -
          <lpage>653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aminov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shabtai</surname>
          </string-name>
          ,
          <article-title>Detection of malicious and low throughput data exfiltration over the dns protocol</article-title>
          ,
          <source>Computers &amp; Security</source>
          <volume>80</volume>
          (
          <year>2019</year>
          )
          <fpage>36</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Detecting dns over https based data exfiltration</article-title>
          ,
          <source>Computer Networks</source>
          <volume>209</volume>
          (
          <year>2022</year>
          )
          <fpage>108919</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>A dns tunneling detection method based on deep learning models to prevent data exfiltration, in: Network and System Security: 13th International Conference</article-title>
          , NSS 2019, Sapporo, Japan,
          <source>December 15-18</source>
          ,
          <year>2019</year>
          , Proceedings 13, Springer,
          <year>2019</year>
          , pp.
          <fpage>520</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>A. Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-Y. Shen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Shashanka</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Detection of exfiltration and tunneling over dns</article-title>
          ,
          <source>in: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>737</fpage>
          -
          <lpage>742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>C. J. C. H. Watkins</surname>
          </string-name>
          ,
          <article-title>Learning from delayed rewards (</article-title>
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ostrovski</surname>
          </string-name>
          , et al.,
          <article-title>Human-level control through deep reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>518</volume>
          (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for cyber security</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>05799</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Klimov</surname>
          </string-name>
          ,
          <article-title>Proximal policy optimization algorithms</article-title>
          ,
          <source>arXiv preprint arXiv:1707.06347</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>High-dimensional continuous control using generalized advantage estimation</article-title>
          ,
          <source>arXiv preprint arXiv:1506.02438</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>M. C. Ghanem</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for intelligent penetration testing</article-title>
          ,
          <source>in: 2018 Second World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4)</source>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>185</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kurniawati</surname>
          </string-name>
          ,
          <article-title>Autonomous penetration testing using reinforcement learning</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>05965</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>M. C. Ghanem</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning for eficient network penetration testing</article-title>
          ,
          <source>Information</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <article-title>6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. O'Brien</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Automated post-breach penetration testing through reinforcement learning</article-title>
          ,
          <source>in: 2020 IEEE Conference on Communications and Network Security (CNS)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yousefi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mtetwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tianfield</surname>
          </string-name>
          ,
          <article-title>A reinforcement learning approach for attack graph analysis</article-title>
          ,
          <source>in: 2018 17th IEEE International Conference On Trust, Security And Privacy In Computing And Communications/12th IEEE International Conference On Big Data Science</source>
          And Engineering (TrustCom/BigDataSE), IEEE,
          <year>2018</year>
          , pp.
          <fpage>212</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Mahendran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Romo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sabur</surname>
          </string-name>
          ,
          <article-title>Autonomous security analysis and penetration testing</article-title>
          ,
          <source>in: 2020 16th International Conference on Mobility, Sensing and Networking (MSN)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>508</fpage>
          -
          <lpage>515</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>