Exploring Agent Behaviors in Network Security through Trajectory Clustering Ondrej Lukas1,*,† , Sebastian Garcia1,† 1 Faculty of Electrical Engineering, Czech Technical University in Prague, Czechia Abstract Reinforcement learning has been successfully used for training security agents, but there have not been explanations for the behavior of their policies. In this work, we study the behavior of reinforcement learning-based attacking agents in network security environments to understand how to improve them. The sequences of steps (trajectories) generated by the agent-environment interactions are used for (i) analyzing the change in behavior during the training process and (ii) analyzing the performance of the policy of the trained agent. Our proposed method uses a vector representation of the trajectory steps, which are clustered to find similarities in the trajectories based on actions taken, their effects on the state of the environment, and the rewards obtained by the agents. The trajectory cluster analysis is paired with additional visualizations to provide a better and deeper understanding of the policies. Preliminary results show that the proposed method can identify behavioral patterns in the agents’ policies and subsequently help guide the agent’s learning process. Keywords Explainable RL, Trajectory Analysis, Policy Evaluation, Network Security 1. Introduction Reinforcement Learning (RL) has been successfully used in various complex problems, from theoretical games to robotics. Its application to the security domain has already been adopted by research in simulated security environments [1, 2]. Various model architectures were proposed for the RL-based agent playing the role of the attacker or penetration tester for traditional and Deep RL methods [3, 4, 5]. The evaluation of an agent and its learning progress often relies only on numerical metrics such as win rate or mean return. While informative, especially during the early stages of the agent training, such evaluation does not provide sufficient insight into the trained agent’s behavior, development throughout the training process, and ability to generalize [6]. Therefore, a deeper insight into the trajectories generated by the agent policy plays an important role in the hyperparameter selection, training setup, and agent verification [7]. In this ongoing work, we are focusing on exploring two main research questions: Late-breaking work, Demos and Doctoral Consortium, colocated with The 2nd World Conference on eXplainable Artificial Intelligence: July 17–19, 2024, Valletta, Malta * Corresponding author. † These authors contributed equally. $ ondrej.lukas@aic.fel.cvut.czc (O. Lukas); sebastian.garcia@agents.fel.cvut.cz (S. Garcia) € https://cs.fel.cvut.cz/en/people/lukasond (O. Lukas); https://cs.fel.cvut.cz/en/people/garciseb (S. Garcia)  0000-0002-7922-8301 (O. Lukas); 0000-0001-6238-9910 (S. Garcia) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1. RQ1: How can a trajectory analysis provide insights for model-agnostic behavior evalua- tion and understanding? 2. RQ2: How do the trajectories exhibit changes during the training process? Furthermore, we aim to explore the suitability of the trajectory analysis for finding similarities in the behavior of different model architectures. These can help identify the necessary steps in task solving and can be further used to validate the agents’ behavior and explain it to humans. The main contribution of this work is an evaluation of a variety of RL agents playing in the NetSecGame environment and comparing their behaviors. The comparison and evaluation are done following a method for processing the game-play trajectories of agent’s policies. 2. Related Work In recent years, there have been notable advancements in the RL for security both in the agents [8, 9, 3, 4, 5] and the environments [2, 1, 8]. Still, there is a lack of explainable methods that would allow verification and easier human-computer cooperation in the security domain. The increased focus on model interpretability can also be seen in the reinforcement learning domain. There are three main approaches to explainable RL: Model explanation only focuses on the underlying model, Policy Explanation explains the behavior of the agent, and Outcome explanation focuses on the local explanation of a (sub)trajectory [10]. In the latter two, the trajectories are commonly used for decision attribution [11], visual explanations [12], directly for model improvement [13] or summarization of agent’s behavior [14, 15] . However, none of these methods was evaluated in a security scenario. 3. Methodology The NetSecGame1 is an environment simulation of high-level network security tasks. The agent, playing the role of the attacker, interacts with the environment in an episodic setup, learning the dynamics of the environment in the process. The scenario used in this work simulates the Sensitive data exfiltration attack, in which the task of the attacker is to (i) understand the topology of the local networks, (ii) locate the sensitive data, (iii) take certain measures to access the data, and finally (iv) exfiltrate the data to an external location outside the local network. In the NetSecGame, state representation and actions do not have a fixed size, which is different from most gym-like environments. States are represented by a collection of assets available to the agent consisting of a set of known networks, a set of known hosts, a set of controlled hosts, a set of known services, and a set of known data. The actions in this environment consist of five action types, each of them having a different set of parameters that are selected from the state representation (e.g., IP addresses and services). All actions have a parameter that identifies from which host (position in the environment) it is executed. For example, given a state 𝑠 in which the agent controls a single host A in a network N, the action of ScanNetwork can be played with parameters source host=A, target network=N. 1 https://github.com/stratosphereips/NetSecGame Such parametrization of actions allows for a modular and flexible environment that can model various scenarios and situations. Also, it makes the trained policies difficult to visualize, analyze, and evaluate. The environment changes after every agent’s move, and a new state and an immediate reward is given to the agent. Each of these steps is represented by a tuple (𝑠𝑡 , 𝑎𝑡 , 𝑟𝑡+1 , 𝑠𝑡+1 ), where 𝑠𝑡 is the current state of the game, 𝑎𝑡 is the action performed in the state 𝑠𝑡 , 𝑟𝑡+1 is the immediate reward for playing action 𝑎𝑡 , and 𝑠𝑡+1 is the following state as the result of action 𝑎𝑡 in state 𝑠𝑡 . A trajectory 𝑡 is a sequence of steps starting from the initial state 𝑠0 until the terminal state of the episode, which ends when the goal is reached, agent detected, or by timeout (reaching the maximum allowed episode length). To analyze the agent’s behavior during and after the training, we capture the trajectories generated by each agent. We evaluate three types of agents (models) in the current stage of this work: Two variants of Q-learning and an LLM-based agent. The first model is a vanilla Q-learning algorithm with decaying 𝜖 exploration. In contrast, the second model still uses Q-learning but is extended with concepts to generalize to networks without knowing details such as the IP addresses, helping merge some of the state-action pairs. This results in better generalization to unknown networks and less overfitting to the topology of the network used in the trained task. The last model evaluated is based on the OpenAI LLM GPT-3.5-turbo. The LLM-agent [8] uses the textual representation of the state, description of the goal, and the environment to select an action to be played in the state. The LLM model is not fine-tuned for playing the role of an attacker, apart from the prompt composition. Trajectories from 500 evaluation episodes were collected for each model at multiple training checkpoints for comparison and analysis. In the case of the LLM agent, there was no training period; thus, only the evaluation trajectories were used. The first part of the policy evaluation focuses only on the sequence of actions. We study the action type distribution per step based on all the trajectories gathered for a policy. The distribution of the action types shows the agent’s primary goal in each step of the interaction. The optimal trajectory in the data exfiltration scenario consists of 5 steps, which allows computing the mean action type efficiency given the set of trajectories 𝑇 as follows: Let 𝑇𝑤𝑖𝑛𝑠 = {𝑡 ∈ 𝑇 | 𝑡.end = win} be a subset of trajectories in which the agent wins. Then, we compute the efficiency of the action type 𝑎𝑡 as |𝑇𝑤𝑖𝑛𝑠 | 𝑒𝑓 𝑓 𝑖𝑐𝑖𝑒𝑛𝑐𝑦𝑎𝑡 (𝑇𝑤𝑖𝑛𝑠 ) = |{𝑠 ∈ 𝑇𝑤𝑖𝑛𝑠 | 𝑠.action = 𝑎𝑡 }| This metric equals 1 for each action type for an optimal trajectory, as each should be played only once. Values less than 1 mean that the action of type 𝑡 is repeatedly played - most likely with incorrect parameters. While analyzing the action sequence brings insights into the agent’s behavior, it does not fully use the information the trajectories provide. Therefore, we propose encoding each step 𝑠 of a trajectory 𝑡 using the following vector representation for further processing and analysis: 1. Size of each component of 𝑠. 2. Size of each component of s𝑛𝑒𝑥𝑡 . 3. Amount of change caused by 𝑎 (|𝑠𝑛𝑒𝑥𝑡 − 𝑠|). 4. Reward 𝑟. 5. Return when starting from the step 𝑠. (Sum of all rewards that the agent expects to receive from state 𝑠 until the end of the episode) 6. One-hot encoded action 𝑎 used in step 𝑠. After the encoding, the trajectory steps are processed by UMAP [16] (Uniform Manifold Approx- imation and Projection). UMAP is a dimensionality reduction technique that efficiently maps high-dimensional data into a lower-dimensional space. It uses manifold learning techniques to model the underlying structure of the data, preserving both local and global structures. We propose using the projection to find similarities among the trajectory steps of different models. 4. Results The results of comparing action types can be seen in Figure 1. It shows the distribution of actions in each step of the trajectory for Q-learning (Figure 1a), Q-Learning with general concepts(Figure 1b) and GPT-3.5 agent(Figure 1c). The bar plot in Figure 1d compares the Action efficiency of each model. UMAP projection of the trajectory steps is shown in Figures 2 and 3. The step number, underlying model, action type, and the outcome of the trajectories are highlighted. 5. Discussion The comparison in Figure 1 shows several differences in the policies, most notably in the lengths of trajectories and the action composition. All three models show in the first steps an initial phase of exploration (mainly composed of Scan Network action and Scan Services action). Still, in the case of the Conceptual Q-learning, it is heavily focused on the FindData action. Since searching the data in a host requires control of the host, taking this action in the first step of the game is impractical as confirmed by the action efficiency of 15% shown in Figure 1d. The second major difference in the behavior of the Conceptual agent is the significant use of Data Exfiltration action in the later stages of the interaction. In comparison, the LLM and Q-learning agents are exfiltrating the data in very few cases, suggesting that it only happens for the correct data point required to win the game. The high action efficiency of the Q-learning agent may indicate that the model could be overfitted to the particular task and network topology. This hypothesis is further supported by a low amount of Find Data actions, likely caused by a lack of exploration. In contrast, the LLM agent (which has no additional training for this particular task) shows more exploration (usage of Scan Network, Find Services, and Find Data). Such behavior, which shows less efficiency in this particular task and topology, can lead to better generalization capabilities of the policy. The UMAP projection in Figure 2 supports the hypothesis of unnecessary use of Exfiltrate Data action of the conceptual agent as those steps should be taken later in the interaction and lead to either a timeout ending or a detection ending, which is visible in the largest central cluster. Model attribution in the second subplot of Figure 2 indicates higher similarity in the Q- learning and LLM trajectory steps despite significant model differences. (a) (b) (c) (d) Figure 1: Figures (a), (b), and (c) show action type distribution per step for Q-learning (a), Q-learning with concepts (b), and LLM-based model (c) in all evaluation episodes. The height of the bar represents the number of evaluation episodes in which the corresponding step was reached. The decreasing height of the bars shows a lower occurrence of long episodes. Figure (d) plot shows action efficiency per model for winning episodes only (higher is better). Figure 3 shows the comparison of the trajectories of the Q-learning model at five distinct points of training. Since the figure depicts only one model type, it shows a lower separation of the clusters. However, the Action type subplot shows smaller clusters around them, which show higher purity and correspond to the winning trajectories. These clusters are attributed to the policies in the later stage of the training, having steps that occurred in the first twenty steps of the trajectories. A possible explanation is that as the model adapts to the environment, the produced trajectories have less exploration and higher similarity. In the projection, this results in the smaller peripheral clusters. A notable exception are the two clusters of steps with action ScanNetwork and FindServices in the lower left part of the plot. The end reason subplot shows that they consist of both winning Figure 2: UMAP projection of the vector representation of trajectory steps. Step number is the sequence number of the step in each trajectory. The knowledge of the End reason of the trajectory is assigned to all the steps in that trajectory. and losing trajectory steps. We can see that those steps occur at the beginning of the trajectories and for most of the models. The most likely reason is that these clusters consist of the agents’ initial recon steps. Since the starting state, while being randomized, is very similar in each of the trajectories, this part of the Q-table is learned very early in the training process. 6. Conclusion and Future Work In this early-stage work, we introduce the policy evaluation for a security scenario using the trajectory step analysis. We propose the vector representation of the trajectories generated by the RL agent and demonstrate its application in visual explanations of trained policies. We show that the trajectory steps and proposed vector representation can be used to find similarities in the policies of different model types. We evaluate the method to explain the policies during the training process. Currently, no DRL models are included in the evaluation. Additionally, comparison with other security environment is needed. In the project’s current phase, the analysis focuses only on the steps of the trajectories, but such an approach might not capture all the complexities of the policy. Future steps should Figure 3: UMAP projection of trajectories obtained from the same Q-learning agent after 5000, 10 000, 15 000, 20 000, and 25 000 training episodes. focus on extending this work to the sequence of steps and potentially whole trajectories. The primary motivation for such an extension is to better understand and interpret the changes in the agent’s behavior during training. Secondly, the better clustering of trajectory steps can allow the detection of agents’ intrinsic sub-goals in the trajectories, their comparison across the model types, and their mapping to the existing knowledge base of attacking techniques. References [1] M. D. R. Team., Cyberbattlesim, https://github.com/microsoft/cyberbattlesim, 2021. Created by Christian Seifert, Michael Betser, William Blum, James Bono, Kate Farris, Emily Goren, Justin Grana, Kristian Holsheimer, Brandon Marken, Joshua Neil, Nicole Nichols, Jugal Parikh, Haoran Wei. [2] J. Janisch, T. Pevnỳ, V. Lisỳ, Nasimemu: Network attack simulator & emulator for training agents generalizing to novel scenarios, in: European Symposium on Research in Computer Security, Springer, 2023, pp. 589–608. [3] S. Chaudhary, A. O’Brien, S. Xu, Automated post-breach penetration testing through reinforcement learning, in: 2020 IEEE Conference on Communications and Network Security (CNS), 2020, pp. 1–2. doi:10.1109/CNS48642.2020.9162301. [4] K. Tran, A. Akella, M. Standen, J. Kim, D. Bowman, T. Richer, C.-T. Lin, Deep hierarchical reinforcement agents for automated penetration testing, 2021. arXiv:2109.06449. [5] Z. Hu, R. Beuran, Y. Tan, Automated penetration testing using deep reinforcement learning, in: 2020 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), 2020, pp. 2–10. doi:10.1109/EuroSPW51379.2020.00010. [6] K. Cobbe, O. Klimov, C. Hesse, T. Kim, J. Schulman, Quantifying generalization in reinforce- ment learning, in: K. Chaudhuri, R. Salakhutdinov (Eds.), Proceedings of the 36th Interna- tional Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Re- search, PMLR, 2019, pp. 1282–1289. URL: https://proceedings.mlr.press/v97/cobbe19a.html. [7] S. Milani, N. Topin, M. Veloso, F. Fang, Explainable reinforcement learning: A survey and comparative review, ACM Comput. Surv. 56 (2024). URL: https://doi.org/10.1145/3616864. doi:10.1145/3616864. [8] M. Rigaki., O. Lukáš., C. Catania., S. Garcia., Out of the cage: How stochastic parrots win in cyber security environments, in: Proceedings of the 16th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART, INSTICC, SciTePress, 2024, pp. 774–781. doi:10.5220/0012391800003636. [9] T. T. Nguyen, V. J. Reddi, Deep reinforcement learning for cyber security, IEEE Transactions on Neural Networks and Learning Systems 34 (2023) 3779–3795. doi:10.1109/TNNLS. 2021.3121870. [10] G. A. Vouros, Explainable deep reinforcement learning: State of the art and chal- lenges, ACM Comput. Surv. 55 (2022). URL: https://doi.org/10.1145/3527448. doi:10.1145/ 3527448. [11] S. V. Deshmukh, A. Dasgupta, B. Krishnamurthy, N. Jiang, C. Agarwal, G. Theocharous, J. Subramanian, Explaining RL Decisions with Trajectories, 2024. URL: http://arxiv.org/ abs/2305.04073. doi:10.48550/arXiv.2305.04073, arXiv:2305.04073 [cs]. [12] Y. Takagi, R. Tabalba, N. Kirshenbaum, J. Leigh, Abstracted trajectory visualization for explainability in reinforcement learning, 2024. arXiv:2402.07928. [13] J. Co-Reyes, Y. Liu, A. Gupta, B. Eysenbach, P. Abbeel, S. Levine, Self-consistent trajectory autoencoder: Hierarchical reinforcement learning with trajectory embeddings, in: J. Dy, A. Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, PMLR, 2018, pp. 1009–1018. URL: https://proceedings.mlr.press/v80/co-reyes18a.html. [14] D. Amir, O. Amir, Highlights: Summarizing agent behavior to people, in: Proceedings of the 17th international conference on autonomous agents and multiagent systems, 2018, pp. 1168–1176. [15] N. Topin, M. Veloso, Generation of policy-level explanations for reinforcement learning, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 2019, pp. 2514–2521. [16] L. McInnes, J. Healy, J. Melville, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints (2018). arXiv:1802.03426.