<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Agent Behaviors in Network Security through Trajectory Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ondrej Lukas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Garcia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Electrical Engineering, Czech Technical University in Prague</institution>
          ,
          <addr-line>Czechia</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Reinforcement learning has been successfully used for training security agents, but there have not been explanations for the behavior of their policies. In this work, we study the behavior of reinforcement learning-based attacking agents in network security environments to understand how to improve them. The sequences of steps (trajectories) generated by the agent-environment interactions are used for (i) analyzing the change in behavior during the training process and (ii) analyzing the performance of the policy of the trained agent. Our proposed method uses a vector representation of the trajectory steps, which are clustered to nfid similarities in the trajectories based on actions taken, their efects on the state of the environment, and the rewards obtained by the agents. The trajectory cluster analysis is paired with additional visualizations to provide a better and deeper understanding of the policies. Preliminary results show that the proposed method can identify behavioral patterns in the agents' policies and subsequently help guide the agent's learning process.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable RL</kwd>
        <kwd>Trajectory Analysis</kwd>
        <kwd>Policy Evaluation</kwd>
        <kwd>Network Security</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reinforcement Learning (RL) has been successfully used in various complex problems, from
theoretical games to robotics. Its application to the security domain has already been adopted by
research in simulated security environments [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Various model architectures were proposed
for the RL-based agent playing the role of the attacker or penetration tester for traditional and
Deep RL methods [
        <xref ref-type="bibr" rid="ref3">3, 4, 5</xref>
        ].
      </p>
      <p>The evaluation of an agent and its learning progress often relies only on numerical metrics
such as win rate or mean return. While informative, especially during the early stages of the
agent training, such evaluation does not provide suficient insight into the trained agent’s
behavior, development throughout the training process, and ability to generalize [6]. Therefore,
a deeper insight into the trajectories generated by the agent policy plays an important role in
the hyperparameter selection, training setup, and agent verification [7]. In this ongoing work,
we are focusing on exploring two main research questions:
1. RQ1: How can a trajectory analysis provide insights for model-agnostic behavior
evaluation and understanding?
2. RQ2: How do the trajectories exhibit changes during the training process?
Furthermore, we aim to explore the suitability of the trajectory analysis for finding similarities
in the behavior of diferent model architectures. These can help identify the necessary steps in
task solving and can be further used to validate the agents’ behavior and explain it to humans.</p>
      <p>The main contribution of this work is an evaluation of a variety of RL agents playing in the
NetSecGame environment and comparing their behaviors. The comparison and evaluation are
done following a method for processing the game-play trajectories of agent’s policies.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In recent years, there have been notable advancements in the RL for security both in the
agents [
        <xref ref-type="bibr" rid="ref3">8, 9, 3, 4, 5</xref>
        ] and the environments [
        <xref ref-type="bibr" rid="ref1 ref2">2, 1, 8</xref>
        ]. Still, there is a lack of explainable methods
that would allow verification and easier human-computer cooperation in the security domain.
      </p>
      <p>The increased focus on model interpretability can also be seen in the reinforcement learning
domain. There are three main approaches to explainable RL: Model explanation only focuses
on the underlying model, Policy Explanation explains the behavior of the agent, and Outcome
explanation focuses on the local explanation of a (sub)trajectory [10].</p>
      <p>In the latter two, the trajectories are commonly used for decision attribution [11], visual
explanations [12], directly for model improvement [13] or summarization of agent’s behavior [14,
15] . However, none of these methods was evaluated in a security scenario.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The NetSecGame1 is an environment simulation of high-level network security tasks. The agent,
playing the role of the attacker, interacts with the environment in an episodic setup, learning
the dynamics of the environment in the process. The scenario used in this work simulates
the Sensitive data exfiltration attack, in which the task of the attacker is to (i) understand the
topology of the local networks, (ii) locate the sensitive data, (iii) take certain measures to access
the data, and finally (iv) exfiltrate the data to an external location outside the local network.</p>
      <p>In the NetSecGame, state representation and actions do not have a fixed size, which is diferent
from most gym-like environments. States are represented by a collection of assets available to
the agent consisting of a set of known networks, a set of known hosts, a set of controlled hosts, a
set of known services, and a set of known data.</p>
      <p>The actions in this environment consist of five action types, each of them having a diferent
set of parameters that are selected from the state representation (e.g., IP addresses and services).
All actions have a parameter that identifies from which host (position in the environment) it is
executed. For example, given a state  in which the agent controls a single host A in a network
N, the action of ScanNetwork can be played with parameters source host=A, target network=N.
1https://github.com/stratosphereips/NetSecGame</p>
      <p>Such parametrization of actions allows for a modular and flexible environment that can
model various scenarios and situations. Also, it makes the trained policies dificult to visualize,
analyze, and evaluate. The environment changes after every agent’s move, and a new state
and an immediate reward is given to the agent. Each of these steps is represented by a tuple
(, , +1, +1), where  is the current state of the game,  is the action performed in the
state , +1 is the immediate reward for playing action , and +1 is the following state as
the result of action  in state .</p>
      <p>A trajectory  is a sequence of steps starting from the initial state 0 until the terminal state
of the episode, which ends when the goal is reached, agent detected, or by timeout (reaching
the maximum allowed episode length). To analyze the agent’s behavior during and after the
training, we capture the trajectories generated by each agent.</p>
      <p>We evaluate three types of agents (models) in the current stage of this work: Two variants
of Q-learning and an LLM-based agent. The first model is a vanilla Q-learning algorithm with
decaying  exploration. In contrast, the second model still uses Q-learning but is extended with
concepts to generalize to networks without knowing details such as the IP addresses, helping
merge some of the state-action pairs. This results in better generalization to unknown networks
and less overfitting to the topology of the network used in the trained task. The last model
evaluated is based on the OpenAI LLM GPT-3.5-turbo. The LLM-agent [8] uses the textual
representation of the state, description of the goal, and the environment to select an action to be
played in the state. The LLM model is not fine-tuned for playing the role of an attacker, apart
from the prompt composition.</p>
      <p>Trajectories from 500 evaluation episodes were collected for each model at multiple training
checkpoints for comparison and analysis. In the case of the LLM agent, there was no training
period; thus, only the evaluation trajectories were used. The first part of the policy evaluation
focuses only on the sequence of actions. We study the action type distribution per step based on
all the trajectories gathered for a policy. The distribution of the action types shows the agent’s
primary goal in each step of the interaction.</p>
      <p>The optimal trajectory in the data exfiltration scenario consists of 5 steps, which allows
computing the mean action type eficiency given the set of trajectories  as follows:</p>
      <p>Let  = { ∈  | .end = win} be a subset of trajectories in which the agent wins. Then,
we compute the eficiency of the action type  as
   () =</p>
      <p>||
|{ ∈  | .action = }|
This metric equals 1 for each action type for an optimal trajectory, as each should be played
only once. Values less than 1 mean that the action of type  is repeatedly played - most likely
with incorrect parameters.</p>
      <p>While analyzing the action sequence brings insights into the agent’s behavior, it does not
fully use the information the trajectories provide. Therefore, we propose encoding each step 
of a trajectory  using the following vector representation for further processing and analysis:
1. Size of each component of .
2. Size of each component of s.
3. Amount of change caused by  (| − |).
4. Reward .
5. Return when starting from the step . (Sum of all rewards that the agent expects to receive
from state  until the end of the episode)
6. One-hot encoded action  used in step .</p>
      <p>After the encoding, the trajectory steps are processed by UMAP [16] (Uniform Manifold
Approximation and Projection). UMAP is a dimensionality reduction technique that eficiently maps
high-dimensional data into a lower-dimensional space. It uses manifold learning techniques
to model the underlying structure of the data, preserving both local and global structures. We
propose using the projection to find similarities among the trajectory steps of diferent models.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The results of comparing action types can be seen in Figure 1. It shows the distribution of
actions in each step of the trajectory for Q-learning (Figure 1a), Q-Learning with general
concepts(Figure 1b) and GPT-3.5 agent(Figure 1c). The bar plot in Figure 1d compares the Action
eficiency of each model.</p>
      <p>UMAP projection of the trajectory steps is shown in Figures 2 and 3. The step number,
underlying model, action type, and the outcome of the trajectories are highlighted.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>The comparison in Figure 1 shows several diferences in the policies, most notably in the lengths
of trajectories and the action composition.</p>
      <p>All three models show in the first steps an initial phase of exploration (mainly composed of
Scan Network action and Scan Services action). Still, in the case of the Conceptual Q-learning, it
is heavily focused on the FindData action. Since searching the data in a host requires control
of the host, taking this action in the first step of the game is impractical as confirmed by the
action eficiency of 15% shown in Figure 1d. The second major diference in the behavior of
the Conceptual agent is the significant use of Data Exfiltration action in the later stages of the
interaction. In comparison, the LLM and Q-learning agents are exfiltrating the data in very few
cases, suggesting that it only happens for the correct data point required to win the game.</p>
      <p>The high action eficiency of the Q-learning agent may indicate that the model could be
overfitted to the particular task and network topology. This hypothesis is further supported by
a low amount of Find Data actions, likely caused by a lack of exploration. In contrast, the LLM
agent (which has no additional training for this particular task) shows more exploration (usage
of Scan Network, Find Services, and Find Data). Such behavior, which shows less eficiency in
this particular task and topology, can lead to better generalization capabilities of the policy.</p>
      <p>The UMAP projection in Figure 2 supports the hypothesis of unnecessary use of Exfiltrate
Data action of the conceptual agent as those steps should be taken later in the interaction and
lead to either a timeout ending or a detection ending, which is visible in the largest central
cluster.</p>
      <p>Model attribution in the second subplot of Figure 2 indicates higher similarity in the
Qlearning and LLM trajectory steps despite significant model diferences.</p>
      <p>(a)
(c)
(b)
(d)</p>
      <p>Figure 3 shows the comparison of the trajectories of the Q-learning model at five distinct
points of training. Since the figure depicts only one model type, it shows a lower separation of
the clusters. However, the Action type subplot shows smaller clusters around them, which show
higher purity and correspond to the winning trajectories. These clusters are attributed to the
policies in the later stage of the training, having steps that occurred in the first twenty steps
of the trajectories. A possible explanation is that as the model adapts to the environment, the
produced trajectories have less exploration and higher similarity. In the projection, this results
in the smaller peripheral clusters.</p>
      <p>A notable exception are the two clusters of steps with action ScanNetwork and FindServices in
the lower left part of the plot. The end reason subplot shows that they consist of both winning
and losing trajectory steps. We can see that those steps occur at the beginning of the trajectories
and for most of the models. The most likely reason is that these clusters consist of the agents’
initial recon steps. Since the starting state, while being randomized, is very similar in each of
the trajectories, this part of the Q-table is learned very early in the training process.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>In this early-stage work, we introduce the policy evaluation for a security scenario using the
trajectory step analysis. We propose the vector representation of the trajectories generated by
the RL agent and demonstrate its application in visual explanations of trained policies. We show
that the trajectory steps and proposed vector representation can be used to find similarities in
the policies of diferent model types. We evaluate the method to explain the policies during the
training process.</p>
      <p>Currently, no DRL models are included in the evaluation. Additionally, comparison with
other security environment is needed.</p>
      <p>In the project’s current phase, the analysis focuses only on the steps of the trajectories, but
such an approach might not capture all the complexities of the policy. Future steps should
focus on extending this work to the sequence of steps and potentially whole trajectories. The
primary motivation for such an extension is to better understand and interpret the changes
in the agent’s behavior during training. Secondly, the better clustering of trajectory steps can
allow the detection of agents’ intrinsic sub-goals in the trajectories, their comparison across the
model types, and their mapping to the existing knowledge base of attacking techniques.
[4] K. Tran, A. Akella, M. Standen, J. Kim, D. Bowman, T. Richer, C.-T. Lin, Deep hierarchical
reinforcement agents for automated penetration testing, 2021. arXiv:2109.06449.
[5] Z. Hu, R. Beuran, Y. Tan, Automated penetration testing using deep reinforcement learning,
in: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&amp;PW),
2020, pp. 2–10. doi:10.1109/EuroSPW51379.2020.00010.
[6] K. Cobbe, O. Klimov, C. Hesse, T. Kim, J. Schulman, Quantifying generalization in
reinforcement learning, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th
International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning
Research, PMLR, 2019, pp. 1282–1289. URL: https://proceedings.mlr.press/v97/cobbe19a.html.
[7] S. Milani, N. Topin, M. Veloso, F. Fang, Explainable reinforcement learning: A survey and
comparative review, ACM Comput. Surv. 56 (2024). URL: https://doi.org/10.1145/3616864.
doi:10.1145/3616864.
[8] M. Rigaki., O. Lukáš., C. Catania., S. Garcia., Out of the cage: How stochastic parrots win
in cyber security environments, in: Proceedings of the 16th International Conference on
Agents and Artificial Intelligence - Volume 3: ICAART, INSTICC, SciTePress, 2024, pp.
774–781. doi:10.5220/0012391800003636.
[9] T. T. Nguyen, V. J. Reddi, Deep reinforcement learning for cyber security, IEEE Transactions
on Neural Networks and Learning Systems 34 (2023) 3779–3795. doi:10.1109/TNNLS.
2021.3121870.
[10] G. A. Vouros, Explainable deep reinforcement learning: State of the art and
challenges, ACM Comput. Surv. 55 (2022). URL: https://doi.org/10.1145/3527448. doi:10.1145/
3527448.
[11] S. V. Deshmukh, A. Dasgupta, B. Krishnamurthy, N. Jiang, C. Agarwal, G. Theocharous,
J. Subramanian, Explaining RL Decisions with Trajectories, 2024. URL: http://arxiv.org/
abs/2305.04073. doi:10.48550/arXiv.2305.04073, arXiv:2305.04073 [cs].
[12] Y. Takagi, R. Tabalba, N. Kirshenbaum, J. Leigh, Abstracted trajectory visualization for
explainability in reinforcement learning, 2024. arXiv:2402.07928.
[13] J. Co-Reyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, S. Levine, Self-consistent trajectory
autoencoder: Hierarchical reinforcement learning with trajectory embeddings, in: J. Dy,
A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning,
volume 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1009–1018. URL:
https://proceedings.mlr.press/v80/co-reyes18a.html.
[14] D. Amir, O. Amir, Highlights: Summarizing agent behavior to people, in: Proceedings of
the 17th international conference on autonomous agents and multiagent systems, 2018,
pp. 1168–1176.
[15] N. Topin, M. Veloso, Generation of policy-level explanations for reinforcement learning,
in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp.
2514–2521.
[16] L. McInnes, J. Healy, J. Melville, UMAP: Uniform Manifold Approximation and Projection
for Dimension Reduction, ArXiv e-prints (2018). arXiv:1802.03426.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. D. R.</given-names>
            <surname>Team</surname>
          </string-name>
          ., Cyberbattlesim, https://github.com/microsoft/cyberbattlesim,
          <year>2021</year>
          . Created by Christian Seifert,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Betser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>William</given-names>
            <surname>Blum</surname>
          </string-name>
          , James Bono, Kate Farris, Emily Goren, Justin Grana, Kristian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Janisch</surname>
          </string-name>
          , T. Pevny`,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Lisy`, Nasimemu: Network attack simulator &amp; emulator for training agents generalizing to novel scenarios</article-title>
          ,
          <source>in: European Symposium on Research in Computer Security</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>589</fpage>
          -
          <lpage>608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. O'Brien</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Automated post-breach penetration testing through reinforcement learning</article-title>
          ,
          <source>in: 2020 IEEE Conference on Communications and Network Security (CNS)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          . doi:
          <volume>10</volume>
          .1109/CNS48642.
          <year>2020</year>
          .
          <volume>9162301</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>